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ETAPS Foreword 


Welcome to the 25th ETAPS! ETAPS 2022 took place in Munich, the beautiful capital 
of Bavaria, in Germany. 

ETAPS 2022 is the 25th instance of the European Joint Conferences on Theory and 
Practice of Software. ETAPS is an annual federated conference established in 1998, 
and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each 
conference has its own Program Committee (PC) and its own Steering Committee 
(SC). The conferences cover various aspects of software systems, ranging from theo- 
retical computer science to foundations of programming languages, analysis tools, and 
formal approaches to software engineering. Organizing these conferences in a coherent, 
highly synchronized conference program enables researchers to participate in an 
exciting event, having the possibility to meet many colleagues working in different 
directions in the field, and to easily attend talks of different conferences. On the 
weekend before the main conference, numerous satellite workshops took place that 
attract many researchers from all over the globe. 

ETAPS 2022 received 362 submissions in total, 111 of which were accepted, 
yielding an overall acceptance rate of 30.7%. I thank all the authors for their interest in 
ETAPS, all the reviewers for their reviewing efforts, the PC members for their con- 
tributions, and in particular the PC (co-)chairs for their hard work in running this entire 
intensive process. Last but not least, my congratulations to all authors of the accepted 
papers! 

ETAPS 2022 featured the unifying invited speakers Alexandra Silva (University 
College London, UK, and Cornell University, USA) and Tomas Vojnar (Brno 
University of Technology, Czech Republic) and the conference-specific invited 
speakers Nathalie Bertrand (Inria Rennes, France) for FoSSaCS and Lenore Zuck 
(University of Illinois at Chicago, USA) for TACAS. Invited tutorials were provided by 
Stacey Jeffery (CWI and QuSoft, The Netherlands) on quantum computing and 
Nicholas Lane (University of Cambridge and Samsung AI Lab, UK) on federated 
learning. 

As this event was the 25th edition of ETAPS, part of the program was a special 
celebration where we looked back on the achievements of ETAPS and its constituting 
conferences in the past, but we also looked into the future, and discussed the challenges 
ahead for research in software science. This edition also reinstated the ETAPS men- 
toring workshop for PhD students. 

ETAPS 2022 took place in Munich, Germany, and was organized jointly by the 
Technical University of Munich (TUM) and the LMU Munich. The former was 
founded in 1868, and the latter in 1472 as the 6th oldest German university still running 
today. Together, they have 100,000 enrolled students, regularly rank among the top 
100 universities worldwide (with TUM’s computer-science department ranked #1 in 
the European Union), and their researchers and alumni include 60 Nobel laureates. 
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The local organization team consisted of Jan Křetínský (general chair), Dirk Beyer 
(general, financial, and workshop chair), Julia Eisentraut (organization chair), and 
Alexandros Evangelidis (local proceedings chair). 

ETAPS 2022 was further supported by the following associations and societies: 
ETAPS e.V., EATCS (European Association for Theoretical Computer Science), 
EAPLS (European Association for Programming Languages and Systems), and EASST 
(European Association of Software Science and Technology). 

The ETAPS Steering Committee consists of an Executive Board, and representa- 
tives of the individual ETAPS conferences, as well as representatives of EATCS, 
EAPLS, and EASST. The Executive Board consists of Holger Hermanns 
(Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofron (Prague), Barbara König 
(Duisburg), Thomas Noll (Aachen), Caterina Urban (Paris), Tarmo Uustalu (Reykjavik 
and Tallinn), and Lenore Zuck (Chicago). 

Other members of the Steering Committee are Patricia Bouyer (Paris), Einar Broch 
Johnsen (Oslo), Dana Fisman (Be’er Sheva), Reiko Heckel (Leicester), Joost-Pieter 
Katoen (Aachen and Twente), Fabrice Kordon (Paris), Jan Křetínský (Munich), Orna 
Kupferman (Jerusalem), Leen Lambers (Cottbus), Tiziana Margaria (Limerick), 
Andrew M. Pitts (Cambridge), Elizabeth Polgreen (Edinburgh), Grigore Rosu (Illinois), 
Peter Ryan (Luxembourg), Sriram Sankaranarayanan (Boulder), Don Sannella 
(Edinburgh), Lutz Schröder (Erlangen), Ilya Sergey (Singapore), Natasha Sharygina 
(Lugano), Pawel Sobocinski (Tallinn), Peter Thiemann (Freiburg), Sebastian Uchitel 
(London and Buenos Aires), Jan Vitek (Prague), Andrzej Wasowski (Copenhagen), 
Thomas Wies (New York), Anton Wijs (Eindhoven), and Manuel Wimmer (Linz). 

Pd like to take this opportunity to thank all authors, attendees, organizers of the 
satellite workshops, and Springer-Verlag GmbH for their support. I hope you all 
enjoyed ETAPS 2022. 

Finally, a big thanks to Jan, Julia, Dirk, and their local organization team for all their 
enormous efforts to make ETAPS a fantastic event. 


February 2022 Marieke Huisman 
ETAPS SC Chair 
ETAPS e.V. President 


Preface 


This volume contains the papers presented at FASE 2022, the 25th International 
Conference on Fundamental Approaches to Software Engineering. FASE 2022 was 
organized as part of the annual European Joint Conferences on Theory and Practice of 
Software (ETAPS 2022). 

FASE is concerned with the foundations on which software engineering is built, 
including topics like software engineering as an engineering discipline, requirements 
engineering, software architectures, software quality, model-driven development, 
software processes, software evolution, Al-based software engineering, and the spec- 
ification, design, and implementation of particular classes of systems, such as (self-) 
adaptive, collaborative, AI, embedded, distributed, mobile, pervasive, cyber-physical, 
or service-oriented applications. 

FASE 2022 received 61 submissions and used a double-anonymous reviewing 
process. Each submission was reviewed by three Program Committee members. After 
an online discussion period, the Program Committee accepted 17 papers as part of the 
conference program (28% acceptance rate). 

FASE 2022 hosted the 4th International Competition on Software Testing 
(Test-Comp 2022). Test-Comp is an annual comparative evaluation of testing tools. 
This edition contained 12 participating tools, from academia and industry. These 
proceedings contain the competition report and two system descriptions of participating 
tools. The system-description papers were reviewed and selected by a separate Program 
Committee: the Test-Comp jury. Each paper was assessed by at least three reviewers. 
Two sessions in the FASE program were reserved for the presentation of the results: the 
summary by the Test-Comp chair and of the participating tools by the developer teams 
in the first session, and the community meeting in the second session. 

Many people contributed to the success of FASE 2022. We are grateful to the 
Program Committee members and reviewers for their thorough reviews and con- 
structive discussions. We thank the ETAPS 2022 organizers, in particular, Jan Kre- 
tinsky and Dirk Beyer (General Chairs), Julia Eisentraut (Organization Chair), 
Maximilian Weininger (Web Chair), and Alexandros Evangelidis (Proceedings Chair). 
We thank Marieke Huisman (ETAPS Steering Committee Chair) and Tarmo Uustalu 
(ETAPS Publicity Chair) for managing the process, and Andrzej Wasowski (FASE 
Steering Committee Chair) for his feedback and support. We are especially grateful to 
our Artefact Evaluation Committee Chairs Marie-Christine Jakobs and Eduard Kam- 
burjan. Last but not least, we would like to thank the authors for their excellent work. 
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Abstract. Contract-based design is a promising methodology for tam- 
ing the complexity of developing sophisticated systems. A formal con- 
tract distinguishes between assumptions, which are constraints that the 
designer of a component puts on the environments in which the com- 
ponent can be used safely, and guarantees, which are promises that the 
designer asks from the team that implements the component. A theory of 
formal contracts can be formalized as an interface theory, which supports 
the composition and refinement of both assumptions and guarantees. 
Although there is a rich landscape of contract-based design methods 
that address functional and extra-functional properties, we present the 
first interface theory that is designed for ensuring system-wide security 
properties. Our framework provides a refinement relation and a compo- 
sition operation that support both incremental design and independent 
implementability. We develop our theory for both stateless and state- 
ful interfaces. We illustrate the applicability of our framework with an 
example inspired from the automotive domain. 


Keywords: Contract-based design, Interface Theory, Hyperproperties, 
Information-flow. 


1 Introduction 


The rise of pervasive information and communication technologies seen in cyber- 
physical systems, internet of things, and blockchain services has been accompa- 
nied by a tremendous growth in the size and complexity of systems [28]. Subtle 
dependencies involving multiple architectural layers and unforeseen environmen- 
tal interactions can expose these systems to cyber-attacks. This problem is fur- 
ther exacerbated by the heterogeneous nature of their constituent components, 


* This project has received funding from the European Unions Horizon 2020 research 
and innovation programme under grant agreement No 956123 and was funded in 
part by the FWF project W1255-N23 and by the ERC-2020-AdG 101020093. 
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which are often developed independently by different teams or providers. In such 
a scenario, defining and enforcing security requirements across components at an 
early stage of the design process becomes a necessity. This engineering approach 
is called security-by-design. Although in recent years there has been impressive 
progress in the verification of security properties for individual system compo- 
nents, the science of compositional security design [22,23] is still in its infancy. 


Security policies are usually enforced by restricting the flow of information 
in a system [30]. Information-flow policies define which information a user or 
a software/hardware component is allowed to observe or to interfere with while 
interacting with another component. 


The goal of information-flow control is to ensure that a system as a whole 
satisfies the desired policies. It is especially challenging to verify that there are 
no side-channels or implicit flows that violate a given policy. For example, in a 
modern car, the tight coupling between the cyber and the physical components 
allows an attacker to infer computational properties, such as secrets used for 
encryption, from side-channels, such as power consumption and electromagnetic 
radiation [17]. Moreover, the increasing connectivity of automotive systems with 
their environment makes it easier for the attacker to gather data about the 
system behavior. The attacker can use this data to exploit weaknesses of the 
system implementation and gain control over the system [32,7]. These attacks 
often rely on analyzing and comparing multiple observations to deduce protected 
information. From a formal-language perspective, such security vulnerabilities 
are not characterized by properties of a single system execution, but rather by 
properties of sets of execution traces, which are called hyperproperties [12]. 


The rigorous design of systems that satisfy information flow requirements is 
essential from the security perspective. This activity can be supported by the 
verification of information flow properties, a well-studied problem with a rich 
landscape of both theory and tools, ranging from language-based [29,18,15,11] 
to simulation-based [25] approaches. Nevertheless, the existing verification solu- 
tions do not address two important aspects. First, components in complex sys- 
tems are often heterogeneous and cannot be analysed with a single verification 
tool. Moreover, it is not clear how to combine component verification outcomes 
to infer system-level information-flow properties. Second, existing methods do 
not provide guidelines on which information flow properties need to be veri- 
fied against individual components to provide system-level guarantees regarding 
leakage of information. 


In this paper, we present a contract-based design [8] approach for the struc- 
tural aspect of information-flow policies. Contract-based design provides a formal 
framework for building complex systems from individual components, mixing 
both top-down and bottom-up steps. A top-down step decomposes and refines 
system-wide requirements; a bottom-up step assembles a system by combin- 
ing available components. A formal contract distinguishes between assumptions, 
which are constraints that the designer of a component puts on the environments 
in which the component can be used safely, and guarantees, which are promises 
that the designer asks from the team that implements the component. A theory 
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of formal contracts can be formalized as an interface theory, which supports the 
composition and refinement of both assumptions and guarantees [2,3,31]. While 
there is a rich landscape of interface theories for functional and extra-functional 
properties [10,4,13,20], we present the first interface theory that is designed for 
ensuring system-wide security properties, thus paving the way for a science of 
safety and security co-engineering. 


The focus on the structural aspects of information flow and abstraction from 
concrete semantics enables compositional reasoning in presence of heterogeneous 
components and is complementary to the existing body of work on information 
flow verification. A different component implementation verified under different 
semantics could result in different flows being detected. However, after deriving 
the component flows from the implementation under some concrete semantics, 
the theory can be agnostic about the underlying semantic interpretation. Hence 
it enables the design of secure systems from trusted components by abstracting 
away how information flows and by focusing on whether it can flow at all. In 
essence, our approach enables to decompose system-level information flow re- 
quirements and derive component properties that need to hold, thus providing 
a divide-and-conquer procedure for organizing verification tasks. 


Our theory is based on information-flow assumptions as well as information- 
flow guarantees. As an interface theory, our theory supports both incremental 
design and independent implementability [3]. Incremental design allows the com- 
position of different system parts, each coming with their own assumptions and 
guarantees, without requiring additional knowledge of the overall design con- 
text. Independent implementability enables the separate refinement of different 
system parts by different teams that, without gaining additional information 
about each other’s design choices, can still be certain that their designs, once 
combined, preserve the specified system-wide requirements. While in previous 
interface theories, the environment of a component is held responsible for meet- 
ing assumptions, and the implementation of the component for the guarantees, 
there are cases of information-flow violations for which blame cannot be assigned 
uniquely to the implementation or the environment. In information-flow inter- 
faces we therefore introduce, besides assumptions and guarantees, a new, third 
type of constraint—called properties—whose enforcement is the shared respon- 
sibility of the implementation and the environment. 


We develop our framework for both stateless and stateful interfaces. Stateless 
information-flow interfaces are built from primitive information-flow constraints— 
assumptions, guarantees, and properties—of the form “the value of a variable x 
is always independent of the value of another variable y.” Stateful information- 
flow interfaces add a temporal dimension, e.g., “the value of y is independent of 
x until the value of z is independent of x.” The temporal dimension is introduced 
through a natural notion of state and state transition for interfaces, not through 
logical operators. We prove that our calculus of information-flow interfaces sat- 
isfies the principles of incremental design and independent implementability. 
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2 Application Example 


We showcase the applicability of our theory with an example from the auto- 
motive industry: a stepwise design of a shared communication infrastructure 
(a bus) from distance warners and a wheel sensor to the braking system and the 
odometer. We adapted this use-case from the industrial case study presented 
by Marcus Mikulcak et al. [25]. The main goal of this system design is to en- 
sure the integrity of a communication channel used to perform a safety-critical 
functionality. We consider two integrity levels, high and low, to characterize 
functionalities in our system. Then, we want to guarantee that data exchanged 
by high-integrity functionalities is not compromised by low-integrity functions. 

Distance warners sense the car’s proximity to other objects and send their 
analysis to other components. In our example, we have two distance warners, 
at the front and the back of the car, that use the shared bus to communicate 
with the braking system. The wheel sensor senses the wheel rotations and sends 
this information through the shared bus to the odometer. The braking system 
is a high-integrity system since it performs safety-critical functions. Hence the 
communication channel between the distance warners and the braking system is 
classified with high-integrity, while the communication between the wheel sensor 
and the odometer is low-integrity. Thus, data sent by the wheel sensor should 
not interfere with the high-integrity channel to prevent distance warnings sent 
to the braking system from being delayed or lost. The main goal of our design 
process is to guarantee that the closed system requirement that information from 
the wheel sensor does not flow to the braking system is propagated accordingly 
to subsystems through successive decomposing and refinement steps. 

Figure 1 shows the graphical representation we 

interface component adopt throughout the paper for the objects in our 

0 input theory. We represent the open system no-flows re- 

T output i p L i quirements with dashed arrows. Then, arrows to 

i input ports are assumptions while arrows to out- 

Sins oh See a t ts are guarantees. The closed system no- 
a See bubba ate quran y 

component flow flows, properties, are represented as dotted arrows. 

To improve the readability of the drawings, it is 

Fig. 1: Representation ofthe implicit that for each drawn property, we have the 

objects in our theory. same guarantee over the open system. When it is 

clear from the context we may omit port(s) names. 

We present, in Fig. 2, the stepwise design of the security requirement that 
data from the wheel sensor, wheel_tick, should not flow to the target of the 
distance warners, distw_f_t and distw_b_t. The first interface in Fig. 2 includes two 
properties that specify this security requirement. The system is then decomposed 
into the sending subsystems (warners and wheel sensor), the shared bus, and the 
receiving subsystems. 

Naturally, we keep the two properties from the first interface as proper- 
ties in the Bus interface. However, this natural decomposition does not define 
a well-formed interface according to our theory because the properties in the 
Bus interface cannot be satisfied given the interface’s current assumption and 
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distw_f_s distw_ft distw_b_t odometer 
Sending Bus Receiving 
distw_f_s distw_f_s [| distw_f_t distw_f_t 
2 i. distw_b_s distw_b-_s [] distw_bt distw_b_t 
3 wheel_tick wheel_tick [L] odometer odometer 
3 Sending Bus Receiving 
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8 distw_f.s distw_f_s / I distw_ft distw_ft 
Ey o distw_b_s distw_b_sy distw_b_t distw_bt 
8 wheel-_tick wheel_tick | SI] odometer odometer 
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Y 
Sending 


Bus Receiving 
er distw fs 7] distw_ft Braking System 
istw_{_s is y is 
distw_bs, | distw_b_t distw_f.t 
wheel_tick ; “L odometer distw_b.t 
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Odometer 


3. 
( } distw_b_s 


wheel_tick 
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Fig. 2: Top-down design of a shared communication infrastructure used by two 
distance warners, distw_f_s and distw_b_s, and a wheel sensor, wheel_tick, to com- 
municate with the braking system, distw_f.t and distw_b_t, and the odometer, 
odometer, respectively. 


guarantee. As the environment allows a flow from wheel_tick to the source of 
the front distance warner, wheel_tick ~~ distw_f_s then, with the flow allowed 
by the guarantee from a distance warner source to its target, we have the flow 
wheel_tick ~~ distw_f_s ~ distw_f_t. This flow is forbidden by the interface’s prop- 
erties. If we specified the no-flow properties in the Bus interface as guarantees, 
then the interface would be well-formed. However, the composition of the three 
subsystems would not satisfy the initial specification because guarantees only 
apply to implementations of their interface, and the flow described above would 
still be allowed in the composition of the three subsystems. This illustrates two 
applications of the information-flow interface theory: to detect inconsistent no- 
flow specifications and faulty decompositions. Moreover, when an interface is not 
well-formed we can provide a witness for the property violation. We can use this 
witness to guide the refinement of an ill-formed interface into a well-formed one. 

In the second step of our refinement, in Figure 2, we add the missing as- 
sumptions to the Bus interface. Our notion of composition compatibility between 
interfaces requires that the Sending interface includes guarantees that implies 
the Bus assumptions, as the Sending interface will be part of the Bus environ- 
ment. At this point, with a certified decomposition of the original specification, 
our theory guarantees that each subsystem can now be further refined indepen- 
dently (possibly by different teams). The last step illustrates an independent 
refinement of the Sending and the Receiving interfaces. 
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In Fig. 3, we present the stateful view of the system, which requires that 
the system satisfies the composition of the Sending, the Bus, and the Receiving 
interfaces derived in Fig. 2 at all times. We present, as well, a refinement of 
that specification, which requires that in each time point only one of the sending 
components can use the bus. The interfaces that define each state are named 
after the sending component that can use the bus (e.g. in the state Swheel only 
the wheel_tick can use the bus). If the access to the bus is mutually exclusive, 
then we can simplify the assumptions on the environment in the Bus interface. 
With more guarantees on the implementations we need fewer assumptions to 
satisfy the properties. 


t Sweet 
distw_f_s, distw_ft refine f ; 
distw_b_s/ f | distw_b_t distw_fs L distw ft 

wheel_tick \ +L] odometer distw_b_s distw_b.t 
N wheel_tick [] odometer 


Suistw.f F YQ \ Sais 
vo QU OOO OU 


distw_f_t 
distw_b_t 
odometer 


distw_f_t distw fs, 


distw_b_t distw_b sV 
odometer wheel tick 


distw_f_s, 
oe 
wheel_tick *+{] 


Fig. 3: Design of mutually exclusive shared communication infrastructure for dis- 
tance warners and the wheel odometer. Each state is defined by the composition 
of the interfaces inside. 


Finally, the components of our system can be, for instance, the Simulink and 
Stateflow models provided to the authors [25] by their industrial partners. We can 
then use the tool introduced in their work to verify whether these components 
implement the stateful interfaces we derived. 

In summary, our framework defines relations on both stateless and stateful 
interfaces specifying information-flow policies that allow to check if: (i) a given 
interface refines (or abstracts) the current specification; (ii) two interfaces are 
compatible for composition; (iii) a specification is consistent; (iv) information- 
flows in a component define an implementation of a given interface; and (v) a 
system decomposition refines the system specification. 


3 Stateless Information-flow Interfaces 


In this section, we introduce a stateless interface and component algebra for 
secure information flow. Information flows between two variables when the value 
of one influences the other. 
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We are interested in the structural properties of information flow within a 
system and define relations abstracting flows, flow relations, as being both reflex- 
ive and transitively closed. An information-flow component abstracts the imple- 
mentation of a system by a flow relation. An information-flow interface specifies 
forbidden flows in an open system by defining three kinds of constraints: as- 
sumptions, guarantees, and properties. The assumption characterizes flows that 
we assume are not part of the environment while the guarantee describes all 
flows the system forbids and that are local to it. The property qualifies the for- 
bidden flows at the interaction between the system and its environment. Hence, 
it represents a requirement on the closed system that needs to be enforced by 
guarantees on the open system and assumptions on its environment. 


Definition 1. Let X and Y be disjoint sets of input and output variables, 
respectively, with Z = X UY the set of all variables. A stateless information- 
flow component is a tuple (X,Y,M), where M C Z xY is a (reflexive and 
transitive) flow relation, called flows. A stateless information-flow interface is 
a tuple (X,Y,A,G,P), where: A C Z x X is a relation, called assumption; 
GCZxY is a relation, called guarantee; and P C Z x Y is a relation, called 
property. 


Given an interface F we are interested in components that do not implement 
flows forbidden by either the interface guarantees (called implementations of F) 
or the interface assumptions (called environments of F). 


Definition 2. A component fe = (Y,X,€) is called an environment of F = 
(X,Y,A,G,P). An environment is admissible for F, denoted by fe = F, iff 
E C A. A component f = (X,Y,M) implements the interface F, denoted by 
JEF, if MET. 


Example 1. 
In Figure 4, we have the first refinement of the in- 
Bus terface Bus from our application example. The 
distw_f_s J distw_f£t B inter. g ifi he i . t h 
distw_b.s [ debt us interface specifies the requirement on the 
wheel_tick [] odometer closed system (using properties) that there are 
sending bus no-flows from wheel_tick to both distw_f_s and 


distw_f_s distw_f_t 
distw_b_s distw_b_t 
wheel_tick odometer 


Fig. 4: Interface Bus with 
an implementation, bus, 
and an admissible environ- 
ment, sending. 


distw_b_s. The Bus interface specifies this re- 
quirement as a guarantee on the open system, 
too. Then, the bus component is an implemen- 
tation of Bus because it has only a flow from 
distw_f_s to distw_f_t, which is not in the guaran- 
tees of the Bus interface. Bus does not have any 
assumptions, then the sending component is an 


environment for Bus. 
When we compose the components sending and bus, there is a flow from wheel_tick 
to distw_f_t, which is in the properties of the Bus. Hence the assumption and 
guarantee specified over the open system are not enough to ensure the property 
over the closed system. The composition of these two components witness that 
the Bus interface is not well-formed. 
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An information-flow interface is well-formed when it has at least one imple- 
mentation and one admissible environment. Therefore, all of its relations must 
be irreflexive. We refer to irreflexive relations as no-flow relations. A well-formed 
interface ensures, additionally, that an interface property is consistent with its 
assumptions and guarantees. An interface property is not consistent when the 
flow relation defined by the composition of one of the interface’s admissible 
environments with one of its implementations includes a pair specified in the 
interface property. To check whether the property is consistent, we compute the 
flow relation of the closed system defined by an interface F, which includes all 
flows that are in the composition of any of the interface’s admissible environ- 
ments with one of its implementations. The main challenge is that, in general, 
the complement of an interface’s guarantee (assumption) may not define the flow 
relation of any of its implementations (environments). Hence there may be no 
maximal implementation or admissible environment for a given interface. 


Example 2. 
bus, bus In Figure 5, we have two components, bus, and bus, that 
implement the interface Bus from the previous example. 
A maximal implementation of Bus must include the flows 
in both bus, and bus. As flows are transitively closed, 
Fig.5: Bus im- the maximal implementation would include a flow from 


plementations. wheel_tick to distw_f_t, which violates the Bus guarantees. 


Given that we do not have maximal implementations and maximal admis- 
sible environments, then we cannot characterize all flows of the closed system 
defined by an interface F by computing the transitive closure of all pairs in the 
complement of F’s assumption and guarantee — (A UG)*. This approach would 
yield more flows than the flows of the closed system defined by F. Instead, we 
consider all pairs (z, z’) such that there exists a path from z to z’ that alternates 
between flows in the complement of the assumption, A, and the guarantee, G. We 
define this notion below as the composition between no-flow relations. In Propo- 
sition 1 we prove that this definition captures our intended relation between an 
interface property and its environments and implementations. 


Definition 3. A no-flow relation V C (AU B) x B is an irreflexive relation. 
and its complement is N = ((AU B) x B)\ N. Let N C (AUB) x B and 
N’ C (A'U B’) x B’ be two no-flow relations. The set of flows defined by their 
composition is N eN’ = (Idarup UN) o (N70 N)* o (Idg UN"), where Idz = 
{(z,z)|z E€ Z} and Ro R' = {(z, 2") | (z, 2’) € R and (z', 2") € R'} is the usual 


composition between relations. 
We have now all the ingredients to define well-formed interfaces. 


Definition 4. An interface (X,Y,A,G,P) is well-formed iff A, G and P are 
no-flow relations; and the property is consistent, i.e. (AeG)NP =9. 


Proposition 1. For all well-formed interfaces F = (X,Y,A,G,P), and for all 
components f = (X,Y,M) and fe = (Y,X,€): if f implements F, f = F, and 
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fe is an admissible environment of F, fe = F, then their combined flows are 
consistent with the property of F, (MUE)*NP=9. 


3.1 Composition and Incremental Design 


We now present how to compose components and interfaces. We introduce a 
compatibility predicate that checks whether the composition of two interfaces is 
a well-formed interface. We prove that these two notions support the incremental 
design of systems. 


The different types of variables between interfaces F and F” are defined as 
YF, ri =YuU Y’, XF F = (X U X") \ YF F’, and ZFF’ = YF, F’ U XF F. The 
same definition applies to components f and f’. The composition of components 
f and f’ is the reflexive and transitive closure of the union of the individual 
component flows, i.e. f @ f! = (Xs, p, Yp fpr, (M U M’)*). We present interface 
composition by defining separately A, G and P of the composed interface. 

We compose interfaces through their shared variables. Shared variables be- 
tween two interfaces are all variables that are an input variable in one of the 
interfaces and an output variable in the other one. The composite flows between 
two interfaces is the set with all flows that are in the composition of any of their 
implementations. As for the definition of flows in the closed system defined by 
an interface, the composite flows are the composition of the guarantees of the 
interfaces being composed (as defined in Definition 3). The composition of two 
interfaces should not restrict their sets of implementations, thus the composite 
guarantees are the complement of the composite flows. 


Definition 5. Let F = (X,Y, A,G,P) and F' = (X', Y’, A',G', P") be two in- 
terfaces. Their composite flows are Grr = Ge g' . The composite guarantees 
of F and F" are defined as Gr rp =(Zr F'XYr F')\G r,r, also denoted by GFoF'. 


The assumption of an interface resulting from the composition of multiple 
interfaces is the weakest condition on the environment that allows the interfaces 
being composed to work together. Additionally, it must support incremental 
design, i.e. the admissibility of an environment must be independent of the order 
in which the interfaces are composed. 


Naturally, all assumptions of each interface must be considered during com- 
position. However, not all of them can be kept as assumptions of the composite 
interface, because shared variables will be output variables of the composition. If 
the environment can still influence the information flow to a shared variable, then 
we may need to add assumptions to prevent such a flow. Propagated assumptions 
between two interfaces are derived by looking in their respective assumptions for 
no-flow pairs pointing to a shared variable. 
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Example 3. 
In Figure 6, we depict an interface specifying 


z z information-flow policies for a car immobilizer, 
key a a = can Fimm, along with an interface for a Controller Area 
EE, imm im T deb Network (CAN bus), Fean. Interface Fimm has only 

one assumption that key does not flow to can. 
Fig.6: Propagating as- In this design, the immobilizer uses the CAN to 
sumptions. communicate with the car electronic control unit 
(ECU). Our goal is to compose both interfaces. 
These interfaces share the port can. Thus, can will be an output port of their 
composition. The interface Fean cannot guarantee that the only assumption in 
Fimm is satisfied after composition because it does not have a port key. As we 
are working with open systems and assume that the environment is helpful, we 
can add further assumptions to ensure the correctness of this composition. For 
example, we can add assumptions that prevent key from flowing to an input port 
in Fean that can flow to can. Such flows could be part of a flow from key to can, 
which would violate the assumption we want to enforce. In this case, we note 
that in Fean information in ecu can flow to can. So, the composite interface needs 
to include the assumption that key does not flow to ecu. This is a propagated 
assumption. 


Definition 6. The set of assumptions propagated from F = (X,Y,A,G,P) 
to FY = (X',Y',A'G',P’) is Aror = {(z,2’) | ds © X NY’ s.t. (z,8) € 
A and (z',s) € Grp}. The set with all propagated assumptions of F and F’ is 


App = Aror UAp 4p. The composite assumptions of F and F’ are defined 
as Arp = (AU A’ i Arr’) Q (ZF F' xXr r), also denoted by Arar’. 


Example 4. From the example before, information from the ports ecu, imm and 
deb can all flow to can. So, they are flows in the composite interface and, by 
Definition 5, {(ecu, can), (imm, can), (deb, can) } c GP cic Fean" Then, Arg gp Fean = 


{ (key, ecu), (key, imm), (key, deb)}. From those assumptions only (key, ecu) points 
to a variable in Xp F7, so Ap,,,, Fn = {(key, ecu) }. 


The properties of the composition contains all properties of each interface 
being composed. They include, additionally, all derived properties from the as- 
sumptions and guarantees of the composite. Derived properties are guarantees 
that hold under any admissible environment. They are defined by all pairs (z, y) 
in an interface guarantee s.t. there is no combination of flows allowed by its 
assumptions and guarantees that creates a flow from z to y. Then, the derived 
properties of an assumption A and guarantee G is defined as P&S = G\ (Aeg). 
The composite properties of F and F' are Pp p = PUP! U PARE Ger’, 


Definition 7. The composition of two interfaces F and F is the interface: F Q 
F=(Xpr,Yre,Are,Grr,Prep), where Ar p is defined in Definition 6, 
Grp defined in Definition 5 and Pr rp in the previous paragraph. 


We allow composition for any two arbitrary interfaces. However, not all com- 
positions result in a well-formed interface. We define next the notions of two 
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interfaces being composable and compatible. Composability imposes the syntac- 
tic restriction that both interface’s output variables are disjoint. Compatibility 
captures the semantic requirement that whenever an interface F provides in- 
puts to another interface F”, then F’ needs to include guarantees that imply the 
assumptions of F. 


Definition 8. Two interfaces F = (X,Y,A,G,P) and F’ = (X', Y’',A’',G’,P’) 
are composable iff YNY’ = Ø. The interfaces F and F' are compatible, denoted 
F ~ F' iff they are composable and ((AU A')N (Zr X Yr, F')) C Grrr. 


Clearly, both the composition operator and the compatibility relation are 
commutative. Additionally, we prove that composition preserves well-formedness 
and that it supports incremental design of systems. The full proofs are in the 
appendix. 


Theorem 1. Let F and F" be well-formed interfaces. If the interfaces are com- 
patible, F ~ F", then their composition, F ® F", defines a well-formed interface. 


Theorem 2 (Incremental design). Let F, F’ and F” be interfaces. If F ~ F" 
and (F @ F') ~ F”, then F' ~ F” and F ~ (F' & F"). 


Proof. We proved first that composite assumptions are associative. We assume 
that F ~ F” and (F & F’) ~ F”. The most interesting case is when (z, s) is 
an assumption of F and s is a shared variable between F and F’ @ F”. Then, 
we need to prove that (z,s) € GF,F'or”. We prove this by assuming towards a 


contradiction that (z,s) € Grrerv. We illustrate it in Figure 7. 


By composite flows being associative, (z,s) € 
Gror”. By (z,8) being an assumption of 
F and (s,s) € Grar, then we have the 
derived assumption (z,s’) € App and, so 


Ø om _ (2,8')€ Aror. Moreover, (z, 8’) EG rer pr, be- 

é ia z Ss . 
$ z s cause z can flow to s’ when F & F' is composed 
F FOF Ft with F”. This contradicts our initial assumption 


; . that (F @ F”) ~ F”. 
Fig. 7: Incremental design. 


E Aror 


We prove additionally that composition is associative for compatible inter- 
faces. 


Theorem 3. If F ~ F' and FQF' ~ F", then (F@F')@F" = FQ (F' Q F"). 


Finally, we show that flows resulting from the composition of any components 
that implement two given interfaces are allowed by the composition of these 
interfaces. 


Proposition 2. For all two interfaces F and F', and all two components f = 
(X,Y, M) and f’ = (X',Y',M') that implement them, f = F and f' = F', then 
the composition of the components implements the composition of the interfaces, 
FIF EFSF. 
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3.2 Refinement and Independent Implementability 


We now define a refinement relation between interfaces. Intuitively, an interface 
F' refines F iff F’ admits more environments than F, while possibly constraining 
its implementations. 


Definition 9. Interface F’ = (X',Y',A’',G’,P’) refines F = (X,Y,A,G9,P), 
written F! < F, when A CA, GCG! andP CP’. 


Let F and F” be interfaces s.t. F” < F. Let f = (X,Y, M) and fe = (Y, X, £) 
be components. Then, (a) If f H| F’, then f = F; and (b) if fe = F, then 
fe H F. 

Additionally, we show below that refinement and composition supports in- 
dependent implementability. 


Theorem 4 (Independent implementability). For all well-formed inter- 
faces Fi, Fı and Fp, if F] < Fı and F, ~ Fo, then F] ~ F> and FI @F 2 < F&F. 


Proof. The challenging part is to prove that the refined composite contains all 
properties of the abstracted one, i.e. Prior, C Priora. We prove by induction 
on n € N that if a pair of variables (z, y) cannot be defined by assume-guarantee 
paths of size at most n of the abstract composition, then it cannot be defined by 
assume-guarantee paths of size at most n of the refined composition. We can see 
easily for the base case. If for all (z, s) € Ar, F, S-t. there exists (s, y) € Grr, 
then, by Fi < F, it follows that for all (z, s) € Avy p, there exists (s, y) € Gry ry. 
Hence if (z,y) ¢ Arr, °Gr,,r,, then (z,y) € App, 0Grip, as well. 


3.3 Discussion 


Properties. In this work, we consider transitively closed flows. In this setting, in 
an open system, information can flow from z to z” by flowing from z to s through 
the environment, and then from s to z’ through one of its implementations. As 
our algebra focuses on the design of structural requirements of no-flows in open 
systems, it needs to support the specification of global no-flow requirements. We 
made them explicit by introducing properties. If we did not include properties 
in our interfaces, then either assumptions or guarantees would need to take over 
the role of specifying global no-flows. Let’s assume that, alternatively, guarantees 
would be interpreted as global no-flows. Then, to support incremental design, 
the compatibility criteria between interfaces would turn out to be overly restric- 
tive, with intuitive and correct designs being considered incompatible. This led 
us to the distinction between guarantees and properties, where properties may 
be supported by assumptions on the environment that can restrict the set of 
compatible interfaces. In other words, the main advantage of having properties 
is that the designer can choose how to split the responsibilities between the 
environment and the implementations to satisfy a global no-flow. 
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Semantics. The structural approach that abstracts away semantic considera- 
tions is an important feature of our theory. The practicability of our approach 
lies in the support for the design of such requirements by decoupling the de- 
sign process from (its orthogonal) semantic considerations. Hence, our approach 
does not deny semantics, but rather separates the design of specifications from 
component implementation concerns. The presented approach even allows using 
tailored semantics and tools for different parts of the design. For example, at 
the bottom (component) level, no-flows and flows relations can be instantiated 
with different semantic interpretations. After deriving the component no-flows 
from the implementation under a concrete semantics, the theory can be agnostic 
about the underlying semantic interpretation and can focus on whether there 
exists a flow at all. 


4 Stateful Information-Flow Interfaces 


We extend our theory with stateful components and interfaces. These are tran- 
sition systems in which each state is a stateless component or interface, respec- 
tively. 


Definition 10. Let X and Y be disjoint sets of input and output variables, 
respectively, with Z = X UY the set of all variables. Let Q be a set of states with 
gG € Q being the initial state and 5 : Q > 22 be a transition relation. A stateful 
information-flow component f is a tuple (X,Y, Q, â, ô, M), where M : Q > 24% 
is a state labeling such that for all states q E€ Q, M(q) defines a flow relation. We 
denote by f(q) = (X,Y, M(q)) the stateless component implied by the labeling of 
q. A stateful information-flow interface F is a tuple (X,Y, Q, â, ô, A, G, P), where 
A : Q > 24** is called assumption; G : Q — 27*%Y is called guarantee; 
and P : Q => 2%*¥ is called property. For each state q € Q we denote by 
F(q) = (X,Y, A(q), G(q), P(q)) the stateless interface implied by the assumption, 
guarantee and property of q. 


A stateful interface F is well-formed iff F(q) is a well-formed stateless inter- 
face, and for all q € Q reachable from ĝ the stateless interface F(q) is well-formed. 
In what follows, F = (X,Y, Q, å, ô, A, G, P) and F’ = (X’, Y’,Q’,q, 0’, A’,G’,P’) 
are stateful interfaces, and f = (X,Y, Qrt, Gr, de, M) and fe = (Y, X, Qe, de, Je, E) 
are stateful components. 

A stateful component f implements a stateful interface F if there exists a 
simulation relation from f to F such that the stateless components in the relation 
implement the stateless interfaces they are related to. Admissible environments 
require a simulation relation from them to the interface they are admissible on. 


iS 


Definition 11. A component f implements the interface F, denoted by £ = F, iff 
there exists H C Qe xQ s.t. (Ge, â) € H and for all (q¢,q) € H: (i) f(ae) = FE (q); 
and (ii) if q; € ôr(qe), then there exists a state q! € 6(q) s.t. (qp,q) € H. 
A component fe is an admissible environment for the interface F, denoted by 
fe = F, iff there exists a relation H C Q x Qe s.t. (â ĝe) € H and for all 
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(q,qe) € H: (i) £(qe) H F(q); and (ii) if qd’ € op(q), then there exists a state 
de € de(ge) s.t. (d',qe) € H. 


As for stateless interfaces, we have that interface’s properties are satisfied 
after we compose any of its implementations f with any of its admissible envi- 
ronments fg. 


Proposition 3. For all well-formed interfaces F, and all relations H C Qr x 
Q and He C Q x Qe that witness f = F and fe | F, respectively, it holds: 
(i) (M(ge) UE(ge))* OP(â) = 0; and (ii) for allq € Q that are reachable from ĝ, 
if (qe,q) € H and (q, qe) € He, then (M(qs) UE(ge))* N P(q) = 0. 


Composition of two components is defined as their synchronous product. The 
composition of two interfaces is defined as their synchronous product, as well. 
However, we only keep the states that are defined by the composition of two 
compatible stateless interfaces. 


Definition 12. Let F and F’ be two interfaces. Their composition is defined 
as the tuple: F ® F’ = (XE r, YFF, QFIF, GF’; OFF’, Agr, Grr, Pre), where: 
Ger = (4,7) and Qr = {år }U{ (q, q) | Fd) ~ F(q) y; (42,02) € Orr (11, 44) 
iff q2 E€ 6(qi) and q, € 6'(q}); for all (q,q') € Orr : Fre (a, g) =F(q) @ F'(q'). 


Two stateful interfaces are compatible if the stateless interfaces defined by 
their initial states are compatible, i.e. F(ĝ) ~ F’(q’). It follows from the results 
proved for the stateless interfaces that compatibility is commutative, composition 
preserves well-formedness and stateful interfaces support incremental design. 


Proposition 4. Iff =F and g EG, thenf@gEFO@G. 


Given an interface, we define transitions parameterized by no-flows on its 
input variables (i.e. with fixed assumptions) or on its output variables (i.e. with 
fixed guarantees and properties). 


Definition 13. Let F be an interface. Input transitions from a given state q € 
Q are defined as 5*(q) = {6*(q,A) | A C Z x X} with 6*(¢,A) = {q € 
d(q) | A(q’) = A}. Output transitions from a given state q E€ Q are defined as 
6° (q) = {6° (¢,G,P) |G CZxY andP CZxY} with 6° (¢,G,P) = {q € 
6(q) | G(q’) =G and P(q') = P}. 


Interface F g refines F 4, if all output steps of Fz can be simulated by F 4, while 
all input steps of F4 can be simulated by Fr. This corresponds to alternating 
refinement [5]. 


Definition 14. Interface Fp refines F4, written Fr < F4, iff there exists a rela- 
tion H C QrxQa s.t. (Gr, Ga) € H and for all (qr, qa) € H: (i) Fr(qr) < Fa(qa); 
(ii) for all set of states O € OR(qr), there exists O' € 6\ (qa) s.t. for all set of 
states I’ € 5% (qa), there exists I € 8X (qr) s.t. (ONT) x (AT) CH. 
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Fig. 8: Refined interfaces with witness: (a) relation {(q1, ĝi), (q2,q)}; and (b) 
relation (a, qi). (q2, q2), (q3, >) }- 


Example 5. In Figure 8 we depict two examples of refined stateful interfaces. 

In Figure 8(a) the stateless interface in each state only uses output ports 
and it only specifies properties. The initial state of both stateful interfaces is 
the same, so they clearly refine each other. As there are no assumptions and 
guarantees, then, by Definition 14, we need to check that for all successors of the 
initial state in the refined interface q,, there exists a successor of the initial state 
in the abstract interface q; such that P(g) C Pr(qs). This holds for the states 
(q2, gh). Hence the relation {(q1, 4), (q2, 95) } witnesses the refinement. Note that 
the refined interface is obtained by removing a nondeterministic choice on the 
transition function. 

The witness relation for the refinement depicted in Figure 8(b) is {(@1, 41), 
(g2, 95), (a3, 95) }. The initial states are the same, so the condition (i) in Definition 
14 is trivially satisfied. The refined interface has two distinct output transitions 
from the initial state q,. It can either go to state q2 by choosing the set of 
guarantees and proposition with only one element (x,y) or it can transition to 
state q3 by committing to the set of no-flows {(x, y), (x’,y)} for the guarantees 
and {(z,y)} as property. From the initial state of the abstract interface, there 
exists only one input transition possible, to assume that x does not flow to 2’ 
and y’ does not flow to x. The following holds for both states accessible from 
the initial state in the refined interface: Ar(q2) C Aa(gs) and Ar(q3) C Aa(qd). 
The refined interface specifies an alternative transition from the initial state 
(represented by state q3) that allows more environments while restricting the 
implementation and preserving the property. 


Theorem 5. Let F’ <F. (a) Iff EF’, then f£ KF. (b) If fe KF, then fe EF’. 


Theorem 6 (Independent implementability). For all well-formed inter- 
faces F}, Fy and Fo, if F] < Fy and Fı ~ Fo, then F} ~ Fo and F1 QF2 < F, @F2. 


The composition operation on stateful information-flow interfaces can be 
generalized to distinguish between compatible and incompatible transitions of 
interfaces when they are composed. Usually this is done by labeling transitions 
with letters from an alphabet, so that only transitions with the same letter can 
be synchronized. While necessary for practical modeling, we omit this technical 
generalization to allow the reader to focus on the novelty of our formalism, which 
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is the ability to specify information-flow constraints (environment assumptions, 
implementation guarantees, and global properties) at each state of an interface. 


5 Related Work 


To the best of our knowledge, we are the first to provide a theory for top-down 
and bottom-up design of information-flow system requirements that supports 
both incremental design and independent implementability of systems. The lit- 
erature closest to our work about information-flow focus on the semantic aspects 
of it. The novelty of our work lies on explicit separation of the structural concerns 
from the semantic aspects of information-flow. 

Language-based techniques have been proved useful to verify and enforce 
information flow policies [29]. Examples range from type systems [15] to program 
analysis using program-dependency graphs (PDGs) [18,16]. In our approach we 
aim at composition and refinement notions that are independent of the language 
adopted for the implementations. 

Information-flow properties can be specified with respect to the observed be- 
havior of a system, in which each of its execution runs is abstracted as a trace. In 
this approach, properties often compare multiple executions of a system to certify 
that no forbidden flow can be deduced by an observer. Such properties over mul- 
tiple execution traces are called hyperproperties [12]. Temporal logics [26], like 
LTL or CTL* are used to specify trace properties of reactive systems. HyperLTL 
and HyperCTL* [11] extend temporal logics by introducing quantifiers over path 
variables. They allow relating multiple executions and expressing information- 
flow security properties [12,11]. Epistemic temporal logics (ETL) [9] provide the 
knowledge connective with an implicit quantification over traces. With ETL we 
can reason about the knowledge gain of agents over time. Then, we can spec- 
ify which information can be learned by the agents while interacting with the 
system [6]. All these LTL extensions reason about closed systems while our ap- 
proach allows compositional reasoning about open systems. Moreover, we focus 
here on the structural aspect of information-flow, and not yet on its semantic 
interpretation. Thus, all information-flow trace-based semantics are orthogonal 
to our approach. 

Interface theories belong to the broader area of contract-based design [8], orig- 
inally popularized by Meyer [24], following earlier ideas introduced by Floyd and 
Hoare [14,19]. Our theory follows closely the philosophy for formal frameworks 
for systems design introduced for Interface automata (IA) [1] and Assume/Guar- 
antee (A/G) [2] interfaces. Interface theories were later extended with extra- 
functional requirements such as resource [10], timing [4,13] and security [21] 
requirements. Unlike in previous interface formalisms, we had to introduce the 
notion of properties which capture the intent of the designer and can be used to 
steer the refinement of interfaces. 

Interface for structure and security (ISS) [21] is a variant of IA that enables 
specification of two types of actions on (1) low and (2) high confidential in- 
formation. ISS uses a bisimulation-based notion of non-interference that checks 
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whether the system behaves in the same way when high actions are performed 
or when they are considered hidden actions. Our approach is orthogonal to IA 
and their extensions: we do not characterise the type of actions of each compo- 
nent, but only their input/output ports, defining explicitly the information-flow 
relations between variables. 

Our approach took inspiration from relational interfaces (RIs) [31]. RIs spec- 
ify the legal inputs that the environment is allowed to provide to the component 
along with the legal outputs that the component can generate when provided 
with these input. RIs do not have assumptions and guarantees defined separately. 
Instead, they have a contract that specifies the desired input-output behavior. 
A contract in RIs is expressed over individual traces. Then, an RI contract can 
only relate input and output values in a trace, and not across multiple traces. 
This restricts considerably RIs expressivity concerning information-flow proper- 
ties. Besides, RIs are trace-based interfaces, while in our approach we focus on 
the structural aspect of information-flow, which may change from state to state 
(in the stateful case). Our approach can be seen as a limited way to introduce 
relational properties into A/G interfaces, namely solely for guiding refinement. 
This limited way avoids many of the technical complexities of general relational 
interfaces [31]. 


6 Conclusion 


We propose a novel interface theory to specify information-flow properties. Our 
framework includes both stateless and stateful interfaces and supports both in- 
cremental design and independent implementability. To achieve this, unlike in 
previous interface formalisms, we introduce the notion of properties which cap- 
tures the intent of the designer for the interaction between assumptions and 
guarantees. Moreover, properties can be used to steer the refinement of inter- 
faces. It will be interesting to study the introduction of such design-guiding 
properties in the context of other interface languages. 

As future work, we will explore how to extend our theory with sets of must- 
flows, i.e. support for modal specifications [27]. This will enable, for example, to 
specify flows that a state q must implement so that the system can transition 
to a different state, which is useful to specify declassification of information. 
Another direction is to explore trace semantics for our interfaces. 
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Abstract. Traceability is the capability to represent, understand and analyze the 
relationships between software artefacts. Traceability is at the core of many soft- 
ware engineering activities. This is a blessing in disguise as traceability research 
is scattered among various research subfields, which impairs a global view and 
integration of the different innovations around the recording, identification, eval- 
uation and management of traces. This also limits the adoption of traceability 
solutions in industry. 

In this sense, the goal of this paper is to present a characterization of the trace- 
ability mechanism as a feature model depicting the shared and variable elements 
in any traceability proposal. The features in the model are derived from a sur- 
vey of papers related to traceability published in the literature. We believe this 
feature model is useful to assess and compare different proposals and provide a 
common terminology and background. Beyond the feature model, the survey we 
conducted also help us to identify a number of challenges to be solved in order 
to move traceability forward, especially in a context where, due to the increasing 
importance of AI techniques in Software Engineering, traces are more important 
than ever in order to be able to reproduce and explain AI decisions. 


1 Introduction 


The need for traceability has always been salient in software and systems development. 
Across the years, there has been a continuous interest in developing techniques to fa- 
cilitate the representation and analysis of traces and links between related artefacts. It 
helps explaining their execution and evolution as required in many software engineer- 
ing activities and disciplines such as code-generation, program understanding, software 
maintenance, and debugging. 

The importance of traceability was first recognized in system engineering, espe- 
cially related to the development and certification of critical systems where it is a pri- 
mary concern. As an example, traceability is part of any certification mechanism in all 
commercial software-based aerospace systems as stated in documents like the RTCA 
DO-178C (2012) [76,62]. The consideration of various levels of abstraction in software 
development and the meaning of verification in model-based development paradigm 
— which figures abstract representations (models) as the core artefact for conceptual- 
ization — was later introduced with companion documents (specifically, DO-331). The 
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automotive industry has followed the same path with the construction of an international 
standard for functional safety, the ISO-26262 [46]. 

Despite these important evidences on the need for explicit (and automated) tracing 
abilities in software development, traceability is not widely adopted, even less auto- 
mated. There is little feedback from its concrete use in industry beyond the critical 
domains above [75] and when existing, it ends up being mostly a manual process [55]. 
Moreover, with no standard definition or representation of traces, it is difficult to bridge 
the gaps between the different partial traceability solutions existing in research sub- 
fields [4,102,101]. Even the software engineering body of knowledge does not seem to 
properly consider the power of traceability as it only mentions traceability once [16]. 

The foundation for an effective modelling of traceability is disseminated among a 
profuse literature. Approaches vary greatly in their means and goals. Moreover, most 
focus on specific pairs of artefacts and therefore remain difficult to integrate in different 
industrial scenarios. Note also that this happens in a context where artificial intelligence 
techniques are being integrated in development processes, raising the need for more 
powerful reproducibility and explainability concerns, both requiring the assistance of 
traceability mechanisms. 

This paper aims to provide a comprehensive perspective on the state of the art of 
traceability techniques in software development and their limitations. With the short- 
term goal of facilitating the evaluation and comparison of current solutions. And the 
mid-term goal of accelerating the development of new traceability solutions that could 
benefit from the existing ones thanks to our new conceptualization in the form of a 
feature model describing the potential dimensions and concerns a traceability solution 
may wish to consider. We do not create the feature model only based on our (partial) 
knowledge and expertise in the domain. Instead, we ground our classification with a 
survey of the published literature in this field. According to this survey, we group the 
traceability features in three main dimensions: trace definition, trace identification and 
trace management, with the corresponding feature hierarchies for each of them. 

The paper is organized as follows. After a brief introduction, we discuss in Section 2 
an overview of the scientific work related to traceability. We then remind some basic 
terminology in Section 3. Section 4 describes how we conducted our literature review 
and Section 5 presents a detailed feature model derived from the survey of the retrieved 
works. This analysis also helps us to propose a number of discussion points and open 
challenges in Section 6 before concluding this work. 


2 State of the art of software traceability 


Traceability was proposed, from the very beginning of software engineering, to ensure 
that a system being developed actually reflects its design. Already in the original NATO 
working conference, quality projects were praised for making "the system that they are 
designing contain explicit traces of the design process" [81]. From that point on, trace- 
ability has been studied from a myriad of perspectives, dimensions and applications. 
Historically, traceability historically started in requirement engineering. The very 
idea to follow the impact of changes in the requirements to other artefacts (and back- 
ward) was then and remains today the most prominent goal [35]. Precise and rich re- 
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quirements allow a proper follow up of their later implementations [21]. Through time, 
the advantages of using traces — i.e., the record of (inter-)dependencies between arte- 
facts, has revealed to be applicable to most if not all sphere of software maintenance. 
The use of traces spans from software certification and testing, feature location, de- 
bugging, code generation, and so on. With the proliferation of traceability purposes, 
some authors explicitly asked for better sharing of experiences in using traceability 
[36] and evaluating the solutions existing so far [91]. Surveys and literature reviews 
trying to group and compare them began to appear as well, though most of them fo- 
cused on specific subareas such as requirement engineering [35,15], model-driven de- 
velopment [32,101,70,86,63], software product lines [96,3], benchmarking [91], and 
information retrieval [23,13,39]. To complement these scientific surveys, Konigs et al. 
survey industrial application of traceability approaches, showing its limited penetra- 
tion [52]. Neumuller et al. show that the adoption is worse in small businesses where 
traceability is even less automated [67]. Finally, Charalampidou et al. add to the conclu- 
sion of other surveys that "although many studies include some empirical validation", 
there is still much to be done with respect to validation and reproducibility [20]. 


This is aggravated by the fact that, as pointed out above, many of the proposals 
belong to different research subfields, which limits the discovery and awareness of al- 
ternative solutions. For instance, authors point out that researchers in requirement en- 
gineering and in model-based development do not communicate enough among each 
others [101,70,85]. This lack of communication and shared understanding is one of the 
open challenges in the traceability domain [22,4,28]. To solve this issue, several works 
aim at proposing specific traceability models. Unfortunately, many investigations suffer 
a lack of generalizability due the specific nature of the problem being solved (e.g., certi- 
fication conformity [51], model transformation coevolution [38]), or the specific nature 
of the solution considered (e.g., w.r.t. its language: SysML [65], w.r.t. its engineering 
field: SPL [3], agile [60]). 


As an example, the automatic identification of trace links is one of the most stud- 
ied features. There are plenty of proposals but as they are evaluated using different 
datasets and configurations, they cannot be directly compared [89,39,13]. Another ex- 
ample would be model-driven engineering, where the use of traceability specific lan- 
guages together with automated model transformation appears as an ideal soil to grow 
end-to-end traceability. This led authors to present classifications and terminologies 
for a systematic perspective on the tracing of MDE development [70,28,85]. Never- 
theless, proposals tend to focus on a specific model-driven engineering problem: the 
co-evolution of models and transformations [2] instead of aiming for more general so- 
lutions. Mustafa et al. argue that "the main issues in traceability nowadays are building 
traceability models that can accommodate the capturing of traceability information and 
providing common semantics for trace links" [63]. As a result of this confusing situa- 
tion, authors asked for more standardized practices. Two proposals gather terminology 
for fundamental and model based terminology [36,45]. We take our general knowledge 
about traceability from them and add to their definitions an actionable categorization 
for existing and coming traceability approaches. 


We agree with these authors that this lack of de juro /de facto standard is hampering 
the benefits of current solutions and hindering evolution in the field. This paper intends 
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to cover this gap by proposing a traceability characterization that stems from the anal- 
ysis of existing proposals. We believe this model can be useful to researchers trying to 
improve traceability techniques in any subfield and to practitioners looking for a way to 
compare and choose the traceability solution that best suits their needs. 


3 Towards a common traceablility terminology 


A clear conclusion from the previous section is the lack of a common agreed upon con- 
ceptualization for traceability that helps evaluating, comparing and reusing traceability 
solutions over a variety of scenarios and application domains. Thus, the incoherency 
problem still arises in traceability research [100]. Even if an individual article makes a 
claim that withstood rigorous testing and statistical analysis, it might not use the same 
words as an adjacent article, or it would use the same words but intend different mean- 
ings. For instance, the term traceability is used to designate both the ability to trace 
system elements, and the traceability links (the relations) themselves [15,4]. 

Therefore, before proposing our global traceability feature model to classify trace- 
ability solutions, we first recap the different usages of the key traceability concepts and 
propose a unified definition that we will use in the rest of the paper. 


3.1 Traceability components 


Traceability research refers mainly to a definition from Gotel et al. that defines trace- 
ability as the ability to describe and follow the life-cycle of a requirement, from its 
initial specification to the design and code elements of the system implementing it [35]. 
This is still the most popular meaning for traceability [15,7] even if modeling ap- 
proaches try to generalize this notion by seeing traceability as a valuable tool to link 
all types of linking artefacts at either the same or different levels of abstraction [56,95]. 

Regardless of the specific interpretation of traceability, we observe a division of 
knowledge into four main areas: 


— Strategizing traceability. It involves defining the explicit traceability purpose for 
the project at hand and how to best reach that goal. Maro et al. address the impor- 
tance of a coherent strategy. The authors propose an introductory methodology to 
"provide support for establishing a traceability strategy that allows the organization 
to achieve its goals and measure the impact of [its] traceability strategy" [60]. 

— Trace and artefact representation. It covers the design / adaptation of a language 
to be used to define the traces and decisions regarding its syntax, expressiveness, 
variability, integration, etc. For instance, this can be done by means of creating a 
full traceability domain-specific language. 

— Trace link identification. It designates the identification of traces in a software 
system, be it a post-requirement assisted elicitation, a live record during a system 
execution or an automatic AlI-based inference process. This latter approach is the 
motto right now to help the identification of links between heterogeneous artefacts. 

— Trace management. It refers to the ways to use and maintain the traces. This in- 
cludes tool support for the persistence, retrieval, and analysis of traces. 
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The first area is a high-level concern that influences the requirements of the other 
three to cover the specific needs of a project. These three will therefore be used to 
structure our feature model later on. Note that the representation component should be 
part of any traceability solution as it is the base component to be able to, at the very 
least, express traceability information. 


3.2 Traceability glossary 


We propose some general definitions for the most frequently encountered traceability 
terms while searching for and studying solutions for traceability in any of the above 
categories. These definitions, mostly borrowed from past literature [36,45], aim to en- 
compass the different uses and dimensions of traceability depicted above. Our set of 
terms is not exhaustive but provide a common core generic enough to be then adapted 
to specific scenarios. This is also why we try to be precise with the definitions, while 
also offering room for slightly different (but compatible) interpretations. 


— Traceability is the ability to trace different artefacts of a system (of systems). Gotel 
et al. define traceability as "requirements traceability [which] refers to the ability to 
describe and follow the life of a requirement, in both a forwards and backwards di- 
rection” [35]. Gotel’s definition has been extended to MDE software traceability as 
"any relationship that exists between artifacts involved in the software engineering 
life cycle" [1]. 

— A trace is a path from one artefact to another. A trace is composed of atomic trace 
links that directly relate artefacts to each others. The representation of traces, their 
data structure and behaviour, is defined in a traceability grammar or metamodel [25] 
depending on how the trace language is defined. In any case, the language definition 
specifies the concepts and relationships available to define traces. As discussed 
before, no standard language has emerged yet. 

— An artefact can be any element of a system - e.g., unstructured documentation, 
source code, design diagrams, test cases and suites... The nature of artefacts follows 
two main dimensions: the life cycle phase they belong to (e.g., specification, design, 
implementation, test), and their type (e.g., unstructured natural language, grammar- 
based code, model-based artefact). The granularity of artefacts is the level to 
which artefacts can be decomposed into sub parts. We call a fragment, the resulting 
product of the decomposition of an artefact. A fragment can be itself broken down 
into smaller parts (or sub-fragments), and so on. 

— A trace link is a direct relationship between two artefacts. Links can be typed to 

better support the heterogeneous nature of traceability applications. The type of the 
link can help express the rationale behind the relationship - it informs not only how 
artefacts are linked but also why [55]. Typing is a primary concern in conceptual 
modeling in general [68]. This definition of a link is consistent with the concept of 
link in popular modeling languages like UML or SysML. 
Links can be explicit or implicit. An implicit link shows artefacts bondage at a 
syntactic or semantic level without the need for an explicit link to be part of the 
model (e.g., a binary class and its respective source code artefact are implicitly 
"linked" to each other, yet this bondage is not part of any language or grammar 
definition) [70]. 
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— An agent is the (human) actor accountable for an artefact, or a link. 

— Trace integrity is the degree of reliability that bares a trace. It is an indirect mea- 
sure that includes, for example, both the age of a trace, the volatility of artefacts 
targeted by the trace, and the automation level of tracing features. 


On top of these concepts, a recent work, by Holtmann et al., makes a distinction 
between a foundational and a specifically model-based terminology [45]. This latter 
add a specification for model and language scope definitions, as well as a distinction 
between relational and referential trace links. 


— Intra/Inter model trace links differentiate between relations that links elements 
of the same instance of the language and relations linking elements from distinct 
instances. This distinction was first introduced by Lindval et al. [54]. 

— Intra/Inter DSL differentiate between relations that links elements in models based 
on the same language and relations that links elements in models from different lan- 
guages. 

— The distinction between Relational and Referential trace links lies in the instan- 
tiation (or not) of the instance link. "A relational trace link is represented by a 
dedicated node with incident directed edges pointing to the trace artifact nodes" 
whereas "a referential trace link is a directed edge from one trace artifact node to 
another trace artifact node". In the latter case, a trace link is commonly represented 
as a property of the source artefact. 


Some of these concepts will explicitly appear in our feature traceability model while 
others act as requirements and usages that should be supported/facilitated by the fea- 
tures in the model and taken into account when choosing a specific traceability solu- 
tion depending on how well that solution covers the specific features of interest for the 
project at hand. 


4 Traceability Survey method 


In this section we depict the methodology we followed to collect papers proposing 
traceability solutions, including at the very least the core representation component 
(see previous section). The analysis of these papers will give rise to the feature model 
we will present next. 

The selection process combined the manual selection of a few approaches based 
on our own experience working in this field and/or covered by other meta-studies 
[36,4,22,39] together with a systematic literature search mining bibliographic data sources 
following the literature review process established by Kitchenham and Charters [49]. 
Fig. 1 depicts the three main steps of the process. 
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4.1 Data source and search strategy 


We used DBLP [10] as our core electronic database to search for primary studies on 
traceability. To avoid missing possibly relevant approaches, we decided not to put a 
specific period constraint for the search, but we limited the scope of the search to papers 
of five pages or more to avoid opinion and vision papers, posters, tool demos and other 
types of short papers to reduce the number of results while maximizing their quality. 

Based on the topic of this survey, we defined the terms of the search query accord- 
ing to the recommendations of Kitchenham and Charters [49]. We apply the query on 
the title and abstract of potential relevant publications. As using very generic terms like 
“trace” or “traceability” returned thousands of results, we decided to combine in the 
search query trace-related keywords with language-related ones since we target trace- 
ability proposals that, at the very least, discuss how traces need to be represented / 
expressed and not only discuss their application to some specific domain without go- 
ing deep into the details. As many traceability languages are model-based, we included 
model, modeling, and other core MDE concepts as part of the language variations. This 
resulted in a total of 203 papers. 

Here is the exact query we applied: 
.*(([Tt]rac(eability|ing))|([Tt]race[rs])).* AND 
-*(({[Mm]odel[- ]) (([{Dd] riven) | ([Bb]ased) ) | 
MD [DAE] |Model[l]ing| [Tt] ransformation| DSL| [Ll]anguage) .x« 


4.2 Pruning 


In what follows, we describe our inclusion and exclusion criteria. We further explain 
how we applied these criteria on the previous set of papers. 


Inclusion criteria | Exclusion criteria 
1. the paper is a technical contribution 
2. the paper is about tracing in software engineering|1. the paper is not a primary study 
3. traceability is the main concern of the paper 


Before we applied these criteria on the potential papers fetched by our query, we re- 
moved automatically papers of less than 5 pages long. We also automatically extracted 


"n "n 


papers whose titles mentioned "biology", "education", "kinetics", "logistics", "physi- 
ology", "physics", "neuroscience", "agriculture", and "food" which appeared each in a 
couple of results. We manually examined the 183 papers left and excluded 40 papers 


that did not fulfill the criteria or were duplicates. 


4.3 Snowballing 


At the end of the previous steps, we double-checked that we did not miss any potentially 
relevant approach due to a number of reasons, e.g., some workshop papers are only 
indexed by ACM or papers that may be using different synonyms for traceability like 
“composition” or “extension”. 
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Finally, we added papers we were aware of based on direct knowledge or from 
other surveys we had read (if not already in the result set) and a few more we found 
by snowballing on the selected papers references. They amount to a total of 10 more 
papers. This lead to a final result of 159 papers. Among them, there are 41 journal 
articles, 82 in conference proceedings, and 36 workshop reports (see Table 1). Fig. 2 


shows the chronological distribution of the selected publications. 
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Fig. 2: Papers selected related to traceability and modeling. 


Publication type 
Journal 41 


Conference 82 
Workshop 36 
Table 1: Publication types of the selected papers. 


4.4 Threats to validity in the selection process 


We acknowledge limitations in the execution of our survey method. First, we only used 
DBLP as a source database. Yet, it is recognized as a representative electronic database 
for scientific publications on software engineering and already contains more than five 
million publications from over two million authors. Setting the limit based on the num- 
ber of pages alone to elude short papers is another threat to validity. Yet, it is a repro- 
ducible practice that limits the number of papers to analyse and thus helps concentrate 
on the topic rather than the engineering of the survey. Then, the vocabulary related to 
traceability is scattered among various fields of application with their respective nu- 
ances. We mitigate the risk of missing papers by manually adding papers that were not 
using variations of this term but were still referenced by papers that did. Still, focusing 
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on traceability as a key term was also a conscious decision as we wanted to characterize 
the works in this field, focusing on those papers that define themselves as part of it. 


5 A feature model to characterize software traceability 


This section presents our feature model describing the traceability features and dimen- 
sions found in the analysis of the literature. Our feature model groups them by similar- 
ity and provides additional descriptions on the most important aspects of each one, e.g., 
different existing alternative implementation of the same feature and/or the most/the 
least studied ones in each group. Next subsections provide some background on feature 
modeling and then zoom in to each of the three main dimensions of traceability: trace 
representation, trace identification, and trace management. 


5.1 Introduction to feature modelling 


A feature model leverages features as the abstraction mechanism to reason about prod- 
uct variability. It is a hierarchically arranged set of features, where relationships be- 
tween a parent feature and its child features may be categorized as: and — all sub- 
features must be selected, alternative — only one subfeature can be selected, inclusive 
or — one or more can be selected, mandatory, and optional [48]. Each feature represents 
an increment in product functionality. 

Feature modeling is a technique that has been intensively used for documenting the 
points of variability in a software product line, how the points of variability constraint 
one another, and what constitutes a complete configuration of the system. But beyond 
product lines, feature models are also more and more used to shed light on complex do- 
mains by representing the core concerns and variation points in a complex ecosystems 
(e.g., [17]), as we do in this paper. 


5.2 Trace definition and representation 


All approaches must discuss their representation of trace artefacts even if they can differ 
on the type of traces they consider and the application they target. Representations are 
so diverse that our survey selected more than 80 papers mentioning their own distinct 
definition for traceability — with 20 metamodels effectively depicted in those papers. 
Some researchers present generic graph-based representations [87,37] while others 
focus on representations much more specific to a concrete application like a metamodel 
for change impact analysis [34] or multi-model consistency [94]. In both cases, what 
traceability approaches target and how they represent a trace is differently approached. 

Fig. 3 shows the hierarchy of features related to the definition and the representation 
of trace artefacts. A peculiar focus is put on the typing of traces’ relationships. Typing 
relationships is important to add semantics to the trace so that the engineer can know not 
only what the linked artefacts are but also why they are linked. As such, it facilitates the 
application of traceability solutions to specific domains. We also detail the genericity 
of the language, the nature of the artefacts covered by the traceability proposal, and the 
possibility to annotate traces with quality properties. 
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Fig. 3: Features related to the representation of a trace. 


We would like to remark the contribution of model-based approaches for traceabil- 
ity in this section. The use of MDE tooling such as ATL [84,47], or the Eclipse Mod- 
eling Framework (EMF) allows the automated generation of traceability information 
as a side effect of executing operations [32,101]. The modeling community has pro- 
posed metamodels for end-to-end traceability [43,41], as well as metamodels specific 
to engineering domains such as model transformation [47,3,97,11] or software prod- 
uct line [47,97]. Paige et al. call for more flexible modeling where models of different 
formats are associated to each others’ with annotations that allow automated bond or 
dependency inference between both application and engineering domains [89,72]. 


Language Languages specific to traceability provide the ability to represent trace arte- 
facts with increased relevance and accuracy. Yet, they often suffer the limitation to be 
built ad hoc and lack a significant power of reusability into other domains. Among these 
domain-specific languages for traceability, some authors attempt a generic definition of 
traceability [43,6] while others provide a language specific to a single domain, e.g., 
traceability for software product lines [3]. 
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We found few studies interested in the use of general-purpose software language 
for traceability - even though this would be appealing to industrial partners interested 
in instrumenting their legacy systems code with traceability information to facilitate 
future evolution or migrations [65]. Representing traces in spreadsheets, text files, or 
databases, shows better learning curves than using a domain specific language, but at 
the cost of a cognitive gap between software engineers and domain experts. As an un- 
fortunate consequence, "the maintenance costs turns out to grow accordingly [to the 
usability of generic representations] and team members fail to keep the trace artefacts 
up-to-date" [21]. 

A potential sweet spot lies in the making of orthogonal approaches that “plug” trace- 
ability concerns on top of other languages to benefit from an existing language structure 
while keeping most of the benefits of using a DSL. 


Artefacts targeted We distinguish between the nature of the artefacts targeted by trace- 
ability purposes and their granularity as both dimensions are important. For the nature 
aspect, on the one hand, investigations differ on the development phase they target. 
Linking requirement specifications to design and code level predominate in the litera- 
ture with more than 50% of the papers in the survey addressing requirement traceability. 
Other phases such as test and verification are targeted as well but in a lesser proportion 
(10 approaches). On the other hand, the type of the artefacts is important to deduce the 
level of potential generalization to other phases of the software lifecycle. Papers focus 
on four different types: unstructured document, structured as grammar-, and model- 
based artefacts, and binaries. 

With regard to the granularity of the artefacts targeted, i.e., their level of decompo- 
sition, few approaches go for a customizable granularity to adapt to artefact hierarchies 
[43,60] while most of the others focus on specific types of artefacts (e.g., to concentrate 
their work on specific optimizations of trace identification). 


Relationship types As many authors have demonstrated, offering to the user the abil- 
ity to define personalized types of relations between the artefacts of a system fosters 
the comprehensibility of the traces produced [68]. We distinguish between approaches 
offering predefined types and approaches allowing custom typing. Often the predefined 
types relate to the field of software engineering (implements, inherits, uses, executes 
...), but not only. For example, Maletic et al. mention that a separation between causal, 
non causal, and navigation relationships can be appropriate [57]. Predefined types al- 
low increased monitoring and user-friendliness to developers. They are found in most 
contributions relating the optimization of trace identification. On the other hand, allow- 
ing users to define the types of relationships specific to their area of expertise helps to 
fill the gap between the design and the use of tracing functionalities [102]. 

Obviously a fixed typing facilitates the analysis of the traces as the potential set of 
semantics and interpretations are fixed while offering domain-specific types increases 
the usability and comprehensibility of the approach. As an example, SysMLv?2 is of- 
fering a more powerful mechanism to define links between artefacts compared to the 
previous SysML version (where we had a sole dependency-like mechanism). 

The literature shows also a distinction between approaches considering relation- 
ships with multiple sources and targets and relationships allowing only a single source. 
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Trace quality In most of the papers, quality aspects are barely mentioned. It seems 
quality of the generated traces is not a major focus, or at least storing and annotating 
the traces with such information is not. Yet, a few studies mention coverage and in- 
tegrity. The coverage of a set of execution traces is used in approaches for software 
testing [33]. Coverage is also used by Rath et al. who address the problem of missing 
links between commits and issues with a classifier they train on textual commit infor- 
mation to identify missing links between issues and commits (i.e., a lack in the cov- 
erage indicates such missing links) [82]. Matrix-based visualizations are particularly 
fit to assist coverage related tasks (See Section 5.4). Integrity of traces is addressed in 
work on model transformation where co-evolution figures an automatic verification of 
their coherence with other (versatile) software artefacts [94,92]. In the same manner, 
Heisig et al. tag links which ends artefacts have been modified or deleted to inform 
the user of such changes [43]. The co-evolution of traces implies measuring distances 
between artefacts (syntactic, cognitive, geographic, cultural...) [9]. It also refers to the 
analysis of the changes of the system that impact traceability artefacts [34,98]. In our 
survey, nine papers address artefacts co-evolution and 17 tackle model transformation 
limitations. These latter are a valuable tool to automate co-evolution tasks. In the many 
studies focusing on the optimization of link identification, the quality of the results is 
mainly evaluated with precision and recall measurements and never rely on inherent 
trace artefacts characteristics. Few researchers include a user feedback [13]. 


5.3 Trace identification 


QManual elicitation 


Execution log 


OLive record Dynamic tracing 


Co-evolution 


Mnemonics 
<< Model-matching 


Genetic derivation 


Vector space 
Algebraic IR models ey 


Statistical language models ag [Tonic labeling 
LDA 


Neural networks 
LSTM 
~Parameter manipulation 


Tree representation 


Oldentification rules 


Trace identification 


YDomain contextualisation 


Continuous learning 


Technical context 


Legend 
o— optional Tool assessment Evaluation context Work task 
@— mandatory Project context 
alternative ae 
<j University 
(xor) Study environment — 
< alternative Open source 
(or) Proprietary environment 


Fig. 4: Features related to the identification of trace links 


Fig. 4 shows the hierarchy of features related to the identification of traces with four 
main possible categories: the manual elicitation of traces, their live record during execu- 
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tion and evolution, rule-based alternatives to assist the user with automation potential, 
and AJ-augmented identification with domain contextualization. 


Manual elicitation Manual elicitation makes possible to create traces in an ad hoc 
manner. As an example, one of our industrial partner chose to hire a developer to elicit 
trace links necessary for a certification commitment. This was chosen rather than a 
(semi-)automated approach, as they were not convinced the effort of augmenting an 
existing tool would pay off for that specific project. 


Recording instrumentation Teams can instrument the live record of traces during the 
execution and the evolution of software artefacts. This way traces recording the sys- 
tem changes are a side-effect of those same changes. There are initiatives to instrument 
existing languages such as ATL with rich log generation [84,31], while others con- 
sider trace record an aspect that can be weaved with current existing languages [78,84]. 
Ziegenhagen et al. mix execution traces with metadatas [103], and use developer inter- 
action records [104] to enrich existing traceability artefact. 

Model transformations are considered the hearth and soul of software modeling and, 
consequently, numerous studies attempt to enrich trace generation during transforma- 
tion execution [97,83,31]. This ubiquitous integration (see Fig. 5, bottom branch) allows 
a semantically rich tracing of target and source artefacts [71]. Unfortunately, this option 
can only be applied when the system is being built, not when the system is already in 
place. 


Identification rules Once a system is in place, teams can identify rules that help re- 
trieve and maintain traceability relations [64,93]. Nentwich et al. describe a novel se- 
mantics for first-order logic that produces links instead of truth values and give an ac- 
count of their content management strategy that provides rule-based link generation and 
consistency check [66]. At the model level, Grammel et al. use a graph-based model 
matching technique to exploit metamodel matching techniques for the generation of 
trace links for arbitrary source and target models [37], and Saada et al. recover execu- 
tion traces of model transformation using genetic algorithms [83]. 


Domain contextualization Back in 1992, Borillo et al. published an article on the 
use of information retrieval techniques for linguistics applied to spatial software engi- 
neering [14]. This precursor work opened the box for Al-augmented traceability where 
machine learning algorithms help extract knowledge specific to the application domain 
(later called domain-contextualized traceability [40]). This is specially useful when the 
source (or target) of the trace link is an unstructured document or when such document 
is key to infer traces among other artefacts. 

Today, domain contextualization by means of machine learning for topic modeling, 
word embedding, and more generally knowledge extraction from unorganized text doc- 
uments, is the most popular traceability feature [39,102]. This collective effort made 
the identification of bonds between requirement specifications and other artefacts pos- 
sible with a gradually improving precision [5,23]. Studies on domain contextualization 
are separated into three subgroups according to the type of tools used (algebraic infor- 
mation retrieval models, statistical language models, and neural networks). For exam- 
ple, Florez et al. derive fine-grained requirement to source code links [30], Rath et al. 
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complete missing links between commits and issues [82], Marcus et al. identify links 
between documentation and source code [59]. An interesting publication from Poshy- 
vanyk et al. shows that mixing expertise both in information retrieval techniques and 
engineering domains gives far better results than when taken separately [79]. McMillan 
et al. add that using structural information together with textual information benefits au- 
tomated link recovery (between requirements and source code) [61]. In total, we found 
22 approaches dedicated to this topic alone in our survey. We do not discuss in this paper 
the techniques related to data collection and training optimization. These are important 
features for automated learning which are discussed in depth in specialized literature. 

Teams are also using genetic algorithms to cope with the variety of algorithms and 
parameters these approaches use [58,73], and structural information to foster method- 
ologies interweaving [74]. Unfortunately, a common critique rose against these positive 
results. Too many teams compete with each others to accomplish a better precision and 
recall when there is no standard to the effective quantification of tracing artefacts into 
such variables. Too few attempt at qualifying the overall relation between these mea- 
surement and the effective impact on software development [22]. 

In that regard, Shin et al. propose a set of guidelines for benchmarking automated 
traceability techniques. Their evaluation (of 24 approaches) shows that methods of eval- 
uation (when they are used appropriately) sometimes are not suitable to other applica- 
tion domains and that the variation in results across project is not investigated [91]. This 
corroborate Borg et al. who, in a systematic literature mapping on information retrieval 
approaches to traceability, notice that there are no empirical evidence that any IR model 
outperforms another consistently [13]. The ability to continuously improve the learning 
process is mentioned in the literature but we found no evidence of its application. 


Tool assessment Very few of the traceability approaches have been empirically as- 
sessed on industrial use cases. The actual trend to report solely for precision and recall 
values indicates an important issue in the automated identification of traces and may 
justify the weak investment of industry in this sector [13,69]. 

Borg et al. published a taxonomy for information retrieval techniques applied to 
traceability [12]. They emphasize the importance of the assessment of the tooling used 
to derive or identify traces. More specifically, the authors draw a differentiation between 
two orthogonal dimensions: the evaluation context that precises where in the context 
the tool is assessed (e.g., at a technical, work task, or project level); and, the study 
environment that shows the kind of data used to fulfil the assessment (e.g., proprietary, 
open source, or academic). These features will affect the measurable attributes used for 
the assessment as well as their generalizability. 


5.4 Trace management 


Fig. 5 shows the hierarchy of features related to the management of trace artefacts: their 
maintenance, integrity, persistence, and integration in running software systems. 


Trace Maintenance Trace links may be affected by changes on the artefacts they link 
(directly or transitively) and therefore can easily become obsolete. This gradual decay 
must be seriously taken into account to avoid having to re-elicit traces every time they 
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Fig. 5: Tool support for traceability management. 


need to be analyzed. A manual maintenance is not always impossible but not typically 
feasible in practice due to the amount of information such inspections would involve. 
Co-evolution techniques [64,26,80] attempt to tackle the burden to maintain trace links 
up-to-date [88,19]. 

Beyond being able to manipulate traces, we also need to offer proper ways to vi- 
sualize and inspect them [29]. The use of graphical representations stimulates human 
perception and the integration of such technique in traceability frameworks is a useful 
feature to augment user awareness [43]. On the other hand, matrix-based views offer a 
valuable perspective to understand and analyse traces [53]. They are particularly effi- 
cient in assisting the visualization of coverage characteristics of traceability [33,82]. 

In parallel, allowing a rich formulation of queries to assist the exploration of ex- 
isting traces will help with reducing the amount of information users need to navigate 
through [19]. More precisely, structured text, in the form of metamodel instances or 
XML sheets allows query-based mining of trace datasets [24]. Interaction wise, hyper- 
text links is a de facto standard to browse trace links. Indeed, following links through 
successive clicks has become almost natural. Querying relies on the type of represen- 
tation of traceability artefacts: SQL-like languages benefit from a long history of infor- 
mation mining while dedicated languages offers better legibility. Genetic programming 
has also permitted the automation of query formulation [77]. 


Trace Integrity To cope with the decay and volatility mentioned above, ways to de- 
termine the integrity of existing traces are greatly needed. Work on these questions, al- 
though called out loudly by literature studies, is scarce in practice [101,4]. The first op- 
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tion is given with manual annotation or vetting of trace links to inform about their level 
of reliability. Annotations allow a qualitative and quantitative evaluation [18]. This is 
the case for back-propagation of verification and validation results between design and 
requirements [42]. Some approaches enable the definition of invariant rules while ma- 
nipulating traces or their targets [19]. If the invariant is violated, an exception for that 
trace is automatically generated. For example, we could define a rule that is violated 
when a change occurs in an artefact targeted by a trace if the corresponding link was 
identified more than two versions prior to the current version. In the same vein, Heisig 
et al. tag trace links when their target (or source) artefacts are modified or deleted [43]. 
Thanks to the ubiquitous integration of the tool, warning is raised consequently in EMF. 


Trace persistence Many different storage alternatives exist for traceability artefacts. 
An option is to use SQL-like grammar to store and retrieve traces with the power of 
database tooling, or to use XML documents to represent trace matrix in a transformable 
format [57,27]. The industry uses a lot of informal format and link representations often 
remains implemented in spreadsheets, text files, databases or requirement management 
tools. These links deteriorate quickly during a project as time pressured team members 
fail to update them. Researchers aiming at a reusable approach favour model-based rep- 
resentations able to express specifically defined concepts related to traceability (often 
in a specific domain of application). The burden of maintaining traces coherent is eased 
in model-based solutions [21]. 


Another concern lies in the recording of trace evolution. The trace creation should 
be recorded, with the successive changes that affect it, for evolution analysis. Integrity 
measures respective to evolution events (e.g., creation, modification) should be recorded 
as well to evaluate their evolution during a period of time. Rahimi et al. ensure the co- 
evolution of artefacts and traces [80] using a set of heuristics coupled with refactoring 
detection and information retrieval technique to detect change scenarios between con- 
tiguous versions of software systems. 


System integration Like most of the MDE approaches, Helming et al. use the same 
modeling language for both traceability and system artefacts [44]. Tracing features are 
embedded in the language. The conjunct use of EMF and a dedicated traceability meta- 
model (both written in Ecore) facilitates the integration of traceability features includ- 
ing graphical versions to stimulate human perception and standard analysis of traces 
in native environment. Galvao et al. in their seminal work on traceability and MDE 
call for more loosely coupled traceability support that can integrate external relation- 
ship with independent representations (in another, ideally common language) [32] as 
also elaborated by Azevedo et al. [6]. Finally, the SysMLv2 implementation committee 
is calling for orthogonal implementation of features such as traceability, annotations 
and comment through meta-level libraries in order to keep concerns separated at design 
level. 
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6 Discussion 


The feature model is a first step towards the shared understanding of all dimensions 
involved in a traceability solution. Ideally, a company interested in a certain set of such 
dimensions could try to create its perfect traceability solution by combining the top 
solutions for each dimension. But this is not yet a real possibility as those solution would 
be difficult to combine and, more importantly, several of the features in the feature 
model do not really have a great solution yet. This section elaborates on this discussion 
by presenting some open challenges in software traceability research. 

Common traceability metamodel. We have counted over 20 different traceability 
metamodel proposals. Nevertheless, some are solutions limited to the specific problems 
the authors present as case studies. And these metamodels are rarely reused, if ever. This 
proliferation is a challenge to make different traceability solutions interoperate. The 
research community should agree in a unified proposal that facilitates the composability 
of traceability solutions. 

Security of trace data. Considering that traceability is a major aspect in certifica- 
tion and other critical applications, it is surprising to see so little interest in security 
concerns in relationship to trace artefacts. We believe security mechanisms (even sim- 
ple rule-based access control) for traceability are needed to control who can modify 
what trace data, given the implication such changes can have. 

Library of trace types and semantics. We already mentioned the importance of 
having a rich set of types for traces to let engineers express the reasons behind the 
creation of a given trace. But at the same time, complete freedom makes reusability of 
analysis techniques difficult. We would like to see a rich yet predefined set of types for 
traces that could then be imported in new traceability projects. 

Usefulness of identified traces. Managing a large number of traces is time-consuming. 
As such, we should make sure every explicit trace is actually useful. So far, algorithms 
aimed at automatically identifying traces are compared based on standard properties 
like precision and recall. But they should be evaluated on “usefulness”: are those traces 
useful for the end-user? or are they simply redundant noise? 

Verification, validation and testing of traces. Our ample literature on verification, 
validation and testing methods for software engineering should be extended to deal with 
trace data, especially from a temporal perspective. Reasoning on outdated and poten- 
tially incorrect trace data could have strong damaging impacts on the system as a whole. 
So far, very few approaches target these aspects except in coevolution in model-driven 
engineering. A recent study shows that the ability to justify with evidences and uncer- 
tainty evaluation the quality and integrity of traces is a prerequisite to robust and reliable 
traceability [8]. Given the effort required to create traces in the first place, it is important 
to instill more confidence to practitioners unsure if creating traces is worthwhile. 

Traceability as core concern in general languages. Another important step to- 
wards the mainstream adoption of traceability in industry is the integration of the com- 
mon traceability metamodel in popular modeling languages like UML or SysML, in the 
form of a profile (to be able to directly reuse existing modeling tools available for those 
languages) or new packages in the respective standards. This way, traceability would 
become a core concern and a primary class modeling primitive in software develop- 
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ment while still being a rich concept and not just a variation of the simple generic plain 
dependency relationship we can use right now in those languages. 

Working together with the industry. Orthogonal to all the others, we (the re- 
search community) should aim at more frequent exchanges with practitioners to better 
understand why they still create traces manually instead of reusing any of the dozens of 
existing solutions. Some reasons have been already hinted in this paper, but there might 
be others we are not aware of. If we want traceability research to transfer to industry, 
more and better communication flows should be part of the agenda. 


7 Conclusion 


Our survey reveals a continuous interest in traceability even if, often, it does not have 
the spotlight it deserves given the key role it plays in a good deal of software engi- 
neering tasks *. Work relating to traceability is indeed disseminated within established 
research communities (e.g., debugging, SPL). Existing conceptualizations vary greatly 
depending on the community to which its authors belong to as well as the objectives 
they aim at. As a consequence, a clear and measurable idea of the costs and benefits 
to software traceability is slow to emerge. To help visualize, classify and compare the 
different traceability approaches, we propose a feature model covering all important 
traceability aspects, as derived from a thorough analysis of the traceability literature. 
Following the existing body of work, we put special emphasis in separating how traces 
are represented from how they are identified and managed. 

Beyond the feature model, our analysis highlights several limitations of current 
traceability approaches that should be further developed. We believe advancing on those 
aspects is especially important, even more given the new traceability challenges posed 
by the growing use of AI in Software Engineering (e.g. in terms of reproducibility and 
explainability of the AI decisions) [90,99]. In this sense, we hope this paper serves as 
a “wake-up call” to make sure new AI for SE proposals come together with a proper 
traceability mechanism that assists engineers in evaluating and understanding the im- 
pact of the new AI components in the software engineering process instead of having 
to blindly trust them. 

As further work, we plan to start working on the above-mentioned aspects starting 
with a collaboration with some of the authors of other proposals to map and bridge their 
algorithms and techniques to our modular and quality-focused metamodel in order to 
combine the benefits of a unified and generic approach with those of a more domain- 
specific representation. We will also study how better embed traceability concepts into 
mainstream modeling languages (like UML or SysML) to further facilitate its adoption. 


Acknowledgements: This work has been partially funded by the Spanish gov- 
ernment (LOCOSS project - PID2020-114615RB-I00), and receives support from the 
ECSEL Joint Undertaking (AIDOaRt - grant agreement No 101007350). 


4 As an example, ICSE’18 awarded a trace-based paper as the most influential paper in the 
past 10 years [50]. The work introduced a novel trace-based approach to debugging. Though 
the focus was on the debugging aspect of the paper, traceability was the key to achieve that 
debugging improvement. The word "trace" alone is mentioned 46 times in the 10 pages paper. 
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Abstract. Software verifiers have different strengths and weaknesses, 
depending on properties of the verification task. It is well-known that 
combinations of verifiers via portfolio and selection approaches can help 
to combine the strengths. In this paper, we investigate (a) how to easily 
compose such combinations from existing, ‘off-the-shelf’ verification tools 
without changing them and (b) how much performance improvement easy 
combinations can yield, regarding the effectiveness (number of solved 
problems) and efficiency (consumed resources). First, we contribute a 
method to systematically and conveniently construct verifier combinations 
from existing tools, using the composition framework CoVeriTEam. We 
consider sequential portfolios, parallel portfolios, and algorithm selections. 
Second, we perform a large experiment on 8883 verification tasks to 
show that combinations can improve the verification results without 
additional computational resources. All combinations are constructed 
from off-the-shelf verifiers, that is, we use them as published. The result of 
our work suggests that users of verification tools can achieve a significant 
improvement at a negligible cost (only configure our composition scripts). 


Keywords: Software verification - Program analysis - Cooperative verification - 
Tool Combinations - Portfolio - Algorithm Selection - CoVERITEAM 


1 Introduction 


Automatic software verification has been an active area of research for many 
decades and various tools and techniques have been developed to solve the problem 
of verifying software [3, 7,9, 25,34,37]. The research has also been adopted in 
practice [2, 22,24, 39]. Each tool and technique has its own strengths in specific 
areas. In such a scenario, it becomes obvious to combine these tools to benefit 
from the strengths of individual tools, leading to a ‘meta verifier’ that solves 
more problems. Most current combination approaches are hardcoded, that is, the 
choice of the tools and the way to combine them is specifically programmed. 
We contribute a method to construct combinations in a systematic way, 
independently from the set of tools to use. As for the types of combinations, 
we considered sequential and parallel portfolio [36], and algorithm selection [47]. 
The combinations are composed and executed with the tool CoVeriTgEa [15]. 
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CoVERITEAM is a tool that is based on off-the-shelf atomic actors, which are 
executable units based on tool archives. It provides a simple language to construct 
tool combinations, and manages the download and execution of the existing tools 
on the provided input. CoVERITEAM provides a library of atomic actors for many 
well-known and publicly available verification tools. A new verification tool can 
be easily integrated into CoVERITEAM within a few minutes of effort. 

For our experimental evaluation, we selected eight of the verification tools 
that participated in the 10th competition on software verification [6]. We reused 
the archives submitted to this competition, and composed combinations of three 
types (sequential and parallel portfolio, algorithm selection) with 2, 3, 4, and 8 
verification tools: in total 12 combinations. We evaluated these 12 combinations on 
a large benchmark set consisting of 8 883 verification tasks in total and compared 
the results of the combinations against the results of the existing tools. 

We show that all three combination approaches can lead to considerable 
improvements of the performance regarding effectiveness (number of correctly 
solved instances) and efficiency (consumed resources). 


Contributions. We make the following contributions: 


1. We show how to conveniently construct combination approaches from off-the- 
shelf verification tools in a modular manner, without changing the tools. 

2. We perform an extensive comparative evaluation of sequential portfolio, 
parallel portfolio, and algorithm selection approaches. 

3. A reproduction package containing the tools and experiment data. 


2 Improving Verification by Verifier Combinations 


In this study, we explore different strategies for combining verifiers to improve 
the overall verification effectiveness. We focus on the most commonly applied 
black-box combinations (i.e., combinations that do neither require any changes to 
the existing tools nor communication between verification tools) which we briefly 
describe in the following. 


Verifier Combinations. Existing strategies for combining verifiers can be 
generally classified into one of the following three categories: sequential portfolios 
[17, 33, 53], parallel portfolios [35, 36,40], and algorithm selectors [8, 28,47, 48, 50]. 
We provide an overview over these composition strategies in Figs. 1 and 2. 


Sequential Portfolio. Portfolios combine several verification algorithms by 
executing them either sequentially or in parallel. A sequential portfolio (Fig. 1) 
executes a set of verifiers sequentially by running one verifier after another. In 
this setting, each verifier is assigned a specific time limit and the verifier runs 
until it finds a solution or reaches the time limit. If the current verifier is able 
to solve the given verification task, the sequential composition is stopped and 
the solution is emitted. Otherwise, if a verifier runs into a timeout without, the 
current algorithm is stopped and the next one is started. CPA-Seq [17,53] and 
Ultimate Automizer [33] are examples of sequential portfolios. 
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Verifier 2 Verifier 3 


Result 


Scheduled 


Time 
Fig. 1: Sequential portfolio of verifiers. Each verifier runs for a certain amount of 
time. If a verifier stops without computing a result (grey box), the next one is 
started (white box with double borders). 


Parallel Portfolio. In contrast to sequential portfolios, a parallel portfolio 
(Fig. 2(a)) executes all verification algorithms in parallel, while sharing all system 
resources like CPU time and memory. As soon as one algorithm solves the given 
verification problem, the portfolio is stopped. Based on the assumption that 
all verifiers provide only sound solutions, we can safely take the first solution 
computed as the final result of the overall portfolio. PredatorHP [35,40] is an 
example of a parallel portfolio. 


Algorithm Selection. To reduce spending resources on unsuccessful verifiers, 
algorithm selectors (Fig. 2(b)) are designed to select the verification algorithm 
that is likely well suited to solve a given verification task. More precisely, the algo- 
rithm selector analyzes the given verification problem for common characteristics 
(typically program features like the existence of a loop or an array) and based on 
these features, selects a verification algorithm likely suited for the given problem. 
Then the selected verifier is executed. Algorithm selectors were recently explored 
for selecting a task-dependent verification algorithm (e.g., in PeSCo [48,50]) or a 
complete verification strategy (e.g., in CPAchecker [8]). 

The above combination types have their own advantages and limitations when 
applied in real-world scenarios. While algorithm selectors omit the necessity of 
sharing resources, the approach heavily relies on the used selection algorithm. If 
the selection algorithm is not powerful enough or the selection task is too difficult, 
the selector fails to identify a verifier equipped for the given task. Although 
portfolios omit this problem by assigning the verification task to several verifiers, 
each verifier gets less resources, which could lead to out-of-resource failures. 


3 Construction of Verifier Combinations with CoVERITEAM 


CoVeErITEAM [15] is a tool for creating and executing tool combinations for 
cooperative verification [20]. It consists of a language for tool composition, and 
an execution engine for this language. Tools are considered as verification actors 
(verifiers, validators, testers, transformers), and the inputs consumed and outputs 
produced by the tools as verification artifacts (programs, specifications, witnesses, 
results). Verification artifacts are seen as basic objects, verification actors as 
basic operations, and tool combinations as composition of these operations. 
CoVERITEAM supports execution of most of the well known automated verifi- 
cation tools that are publicly available. The composition operators supported by 
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Fig. 2: Comparison of parallel portfolio and algorithm selection 


CoVeERITEAM are: SEQUENCE, PARALLEL, REPEAT, and ITE. SEQUENCE exe- 
cutes the composed tools sequentially, PARALLEL in parallel, REPEAT repeatedly 
till a termination condition is satisfied, ITE is an if-then-else that executes one 
tool if the provided condition is true and otherwise the other. The work in this 
paper uses SEQUENCE, PARALLEL, ITE, and a newly developed PORTFOLIO. 


3.1 Verifier Based on Sequential Portfolio 


Fig. 3: Verifier based on sequential portfolio 


Figure 3 shows the construction of a sequential portfolio of two verifiers 
verifierl and verifier2 using CoVrERITEAM. This construction uses two kinds of 
compositions: SEQUENCE and ITE. At the outermost level, it is a sequence 
of verifierl and an actor that in itself is a composition—an ITE composition. 
Let us call it ite verifier. When we execute this composition, first, verifier1 is 
executed and then ite_ verifier. ite_ verifier first checks if verifier! was successful 
in verification or not (i.e., verdict ¢ {T, F}). If verifierl was successful, then it 
forwards the results, otherwise, verifier2 is executed and its results are taken. 
This construction can be generalized to create sequential portfolios of arbitrary 
sizes. We used it to create sequential portfolios of 2, 3, 4, and 8 verifiers. 


3.2 Verifier Based on Parallel Portfolios 


We developed a composition operator for parallel portfolio in CoVERITEAm. In 
this composition, multiple tools are executed in parallel and the result of the 
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one that succeeds first is taken. The composition consists of a set of verification 
actors of the same type (verifiers, testers, etc.), and a success condition defined 
over the artifacts produced by these actors. When one actor finishes, the success 
condition is evaluated: if it holds then the output of this actor is taken and the 
execution of the remaining actors is stopped. Otherwise, the portfolio waits for 
the next actor to finish and repeats the check. If none of the actors produce the 
output that satisfies the success condition, the result of the last one is taken. 


Figure 4 shows a parallel portfolio of two verifiers verifierl and verifier2. In this 
case, both the verifiers are executed simultaneously. When one verifier finishes, its 
result is checked for the success condition (i.e., verdict € {T, F}). If the success 
condition holds then the result is forwarded, otherwise, the result is discarded 
and we wait for the second verifier to finish. Once a successful result is available, 
the remaining executing verifiers are terminated. For our experiments, we created 
parallel portfolios of 2, 3, 4, and 8 tools. 


3.3 Verifier Based on Algorithm Selection 


We designed and implemented a generic selection framework in CoVERITEAM for 
selecting verifiers. The framework decomposes the algorithm-selection process into 
two phases: (1) a feature-extraction phase, in which a feature encoder extracts a 
set of predefined features for a given verification task (i.e., certain characteristics 
that are believed to indicate difficulty for a verifier), and (2) selection to identify 
an appropriate verifier based on the extracted features. Each phase is constructed 
using CoVerITEaM actors (explained below in more detail). Figure 5 shows the 
CoVERITEAM composition of a verifier based on algorithm selection. 
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Fig. 5: Verifier based on algorithm selection 


Feature Encoder. The first component of our framework is the feature encoder. 
Given a verification task consisting of a program P and a specification S, the goal 
of the feature encoder is to encode the problem into a meaningful feature-vector 
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(FV) representation, which we can later use to select a verification tool. Typically, 
the representation encodes certain features of a program which might correlate 
with the performance of a verifier such as the occurrence of specific loop pat- 
terns [28] or variable types [29]. In this study, we encode verification problems via 
a learning-based feature encoder by employing a pretrained CST Transformer [50]. 
The CSTTransformer first parses a given program P into a simplified abstract 
syntax tree (AST) representation. Afterwards, a specific type of neural network 
processes the AST structure to produce a vector representation. The last en- 
coding step is learned by pretraining the neural network on selecting various 
verification tools. While this approach was originally developed to learn a vector 
representation optimized for a specific verifier composition, the authors showed 
that the learned encoder can be effectively reused across many new selection 
tasks, often outperforming other hand-crafted feature encoders. 


Selection of Verifiers Based on the Individual Difficulty of the Tasks. 
The same task might be solved with one tool in a few seconds, while another is 
not able to find a solution within the given resource constraints. Therefore, to 
avoid wasting resources on tools that are not well suited for a given task, the 
algorithm selector aims to predict the difficulty of a task before executing a tool. 
Then, the tool that is predicted to be the best suited tool for the task is executed. 

Similar to previous work [28, 50], we learn to predict the difficulty of task with 
hardness models |55]. Based on the previously computed vector representation, a 
hardness model learns to predict the hardness of a given task for a specific tool. 
In our case, this reduces to a binary classification problem of predicting whether 
a tool can solve a task or not. We address this by training logistic regression 
classifiers. The classifier’s confidence that a verifier will fail a particular task then 
determines the hardness of the task. 

Now, given a set of hardness models —each accessing the hardness of a 
verification task for a specific tool— a verification tool is selected for which the 
task is likely easy (i.e., the respective model outputs the lowest hardness score). 
The final selection is done by a comparator implemented in CoVERITEAM that 
selects a tool by comparing the hardness scores. 


3.4 Extensibility 


To facilitate future research and the design of novel combinations, we implemented 
all combination types such that they can be easily configured and extended. Ex- 
tending a combination with a new verifier only requires an actor definition for 
that verifier in CoVeriTEam. Afterwards, this actor can be put in a sequential or 
parallel portfolio by adding it to the composition. While our algorithm selector 
can be easily used with all tools employed during our experiments, extending 
a combination based on algorithm selection with a new verifier requires a bit 
more effort. However, by using hardness models together with a common feature 
representation we simplified the process required for configuring algorithm selec- 
tion. In fact, we are able to modify the set of verifiers to select from by simply 
adding or removing individual hardness models. While previous approaches to 
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Fig. 6: Subsets of verification tools used for composition 


verifier selection often require training the complete selector from scratch, our 
combination can be extended by training a single hardness model.? For training a 
new model, we provide all training scripts that were used for training our hardness 
models and a precomputed dataset of vector representations for SV-COMP 2021. 
Therefore, to integrate a new tool in our algorithm selector, one only requires to 
run the respective verifier once on (a subset of) the benchmark set. The results 
then act as training examples. 


4 Evaluation 


We perform a thorough experimental evaluation on a large benchmark set in order 
to show the potential of combinations. We address the following research questions 
concerning the comparative evaluation of combinations against standalone tools: 


RQ 1. Can a CoVERITEAM-based sequential portfolio of verifiers perform signifi- 
cantly better than standalone tools with respect to 
(a) number of solved verification tasks, and 


(b) resource consumption? 


RQ2. Can a CoVerITeEAam-based parallel portfolio of verifiers perform signifi- 
cantly better than standalone tools with respect to 
(a) number of solved verification tasks, and 
(b) resource consumption? 


RQ3. Can a CoVerITEAaM-based algorithm selection of verifiers perform signifi- 
cantly better than standalone tools with respect to 
(a) number of solved verification tasks, and 


(b) resource consumption? 


4.1 Experimental Setup 


Selection of Existing Verifiers. We selected eight existing verification tools 
that performed well in a recent competition on software verification (SV-COMP 
2021) [6]. We excluded two verifiers from consideration: VeRIABs [27] and 
PESCo [49]. VERIAsBs was excluded because its license does not allow us to 
use it for scientific evaluation, and PESCo because it is a derivate of CPAchecker 
that would not contribute to diversity of technology in the combinations. The 
chosen set of verifiers used for the tool combinations is depicted in Fig. 6. 


2 A single hardness model can be trained within a few minutes on a modern CPU. 
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Tool Combinations. We evaluated twelve verifier combinations: for each of 
sequential portfolio, parallel portfolio, and algorithm selection, we constructed 
a combination of 2, 3, 4, and 8 verifiers. These variants of combinations with 
different numbers of verifiers allowed us to quantify the influence of the number 
of verifiers on the performance. We constructed these subsets of verifiers to 
maximize the number of tasks (from our benchmark set) that can be solved by 
at least one tool in the subset. For sequential portfolios, we additionally rank 
the verifiers in descending order of their success on the benchmark. We used 
the results from SV-COMP 2021 to achieve this. Figure 6 illustrates the sets of 
verifiers that we composed in different types of combinations. 


Execution Environment. Our experiments were executed on machines with 
the following configuration: one 3.4GHz CPU (IntelXeon E3-1230 v5) with 
8 processing units (virtual cores), 33GB RAM, operating system Ubuntu 20.04. 
Each verification run (execution of one tool or combination on one verification 
task) was limited to 8 processing units, 15min of CPU time, and 15GB memory. 
This configuration is the same as the configuration used in SV-COMP 2021 
allowing us to use the competition results of the standalone tools for comparison. 


Benchmark Selection. Our benchmark set consists of all the verification tasks 
with specification unreach-call from the open-source collection of verification 
tasks SV-Benchmarks®. Each verification task consists of a program written in C 
and a specification. The specification is a safety property describing that an error 
location should never be reached. The benchmark set includes all verification 
tasks of the competition categories ReachSafety and Concurrency, and a part 
of the verification tasks in category SoftwareSystems. In total, there were 8 883 
verification tasks in our benchmark set. We evaluated our combinations on the 
version of the benchmark set that was used in SV-COMP 2021 (tag svcomp21). 


Scoring Schema. We not only count the number of results of each kind* for the 
verification tasks, but also the scores as used in the competition, because this 
models what the community considers as quality. A verifier is rewarded score 
points as follows: 2 score points for each correct proof, 1 score point for each 
correct alarm, -32 score points for wrong proofs, and -16 score points for wrong 
alarms. This schema has been used in SV-COMP [6] since a few years and has 
been accepted by the verification community for judging the quality of results. 


Resource Measurement and Benchmark Execution. We used the state-of- 
the-art benchmarking framework BENCHExEc [18] for executing our benchmarks. 
It executes tools in isolation, reports the resource consumption, and also enforces 
the resource limitations. It provides measurements of the consumption of CPU 
time, wall time, memory, and CPU energy during an execution of a tool. 


4.2 Results of Existing Verifiers as Standalone 


Table 1 shows the summary of results of the execution of the standalone tools on 
the selected benchmark set. These results are publicly available in the respective 


3 https: //gitlab.com/sosy-lab /benchmarking/sv- benchmarks 
4 Either claims of program correctness or alarms of specification violations. 
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Table 1: Standalone verifiers 


& 
£ N K & = 
oa ò S le) s y 
; x È $ S S £ £ S 
Verifier © Q % © © Q © © 
Score 9040 6623 4878 7146 4663 3679 2770 5338 
Correct results 5652 4481 3001 4358 3 484 2922 1385 3725 
Correct proofs 3516 2958 1909 2 836 1499 1605 1385 2365 
Correct alarms 2136 1523 1092 1522 1985 1317 0 1360 
Wrong results 8 29 2 2 19 41 (0) 24 
Wrong proofs 0 22 0 1 1 12 0 23 
Wrong alarms 8 T 2 1 18 29 0 1 
Total resource consumption for correct results 
CPU time (h) 190 57 22 97 31 60 11 81 
Wall time (h) 140 57 22 59 31 15 11 52 
Memory (GB) 7000 1800 770 4300 1300 2.000 120 2700 


CPU Energy (KJ) 7700 2500 1000 3500 1300 1500 560 3.000 


Median resource consumption for correct results 


CPU time (s) 61 0.84 081 36 0.70 17 0.78 39 
Wall time (s) 32 0.84 0.84 12 0.69 9.1 0.80 13 
Memory (MB) 600 53 25 450 44 670 25 430 
CPU Energy (J) 590 11 11 310 9.2 150 11 330 
Resource consumption of correct results per score point 
CPU time (s/sp) TT 31 16 49 24 59 15 55 
Wall time (s/sp) 55 31 16 30 24 14 15 35 
Memory (MB/sp) 780 270 160 600 280 540 42 500 
Energy (J/sp) 850 380 210 490 280 420 200 560 


reproduction package of the competition [5] and on the competition web site”. 
We only adjust the presentation to our needs here. 

Figure 7 shows the quantile plots of the results, where the z-coordinate repre- 
sents the quantile of score obtained by the tool below the run time represented 
by y-coordinate. We used a logarithmic scale for time ranges between 1 and 1000 
seconds, and linear scale between 0 and 1 second. The graph of a tool that solves 
more verification tasks will be farther to the right, and the plot of the faster tools 
would be lower. The farther on the right side a plot goes and the lower a plot 
remains. the better it is. More details about these plots are given elsewhere [4]. 

Figure 8 shows the resource consumption for standalone tools using a parallel- 
coordinates plot (each parallel coordinate represents a different variable). The 
plot shows the number of unsolved tasks, and resource consumption per score 
point. The lower the plot of a tool is the better it is for the user. 


4.3 RQ 1: Evaluation of Sequential-Portfolio Verifier 


We now present the results of the sequential-portfolio verifier against the existing 
standalone verifier with the highest score: CPACHECKER. 


5 https: //sv-comp.sosy-lab.org/2021/results /results-verified 
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Fig. 7: Standalone verifiers: Score-based quantile plot for results 
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Fig. 8: Standalone verifiers: Parallel-coordinates plot showing unsolved tasks and 
resource consumption per score point 


Table 2 shows the summary of results for the sequential verifiers. The sequen- 
tial portfolio, in general achieves better score than the best performing standalone 
tool. The portfolio with 8 tools performs worst, which is expected because as we 
increase the size of the portfolio, the amount of time allocated to each verifier 
also decreases. This means that the verifiers can only solve relatively easier tasks. 
The table also shows that the portfolio requires more resources to solve the tasks. 
This is a side effect of the sequential portfolio, as all the resources consumed 
by unsuccessful attempts to solve a given task by the verifiers in a sequence are 
still counted in the resource consumption. Also, the portfolio with 8 tools has a 
considerably large number of wrong results as it is reduced to fast results, instead 
of the verifier earlier in the sequence. The index at which a verifier is placed plays 
a key role in the performance of the sequential portfolio. If we put a verifier that 
produces results fast but has more wrong results first in the sequential portfolio, 
then the overall results are going to have a lot of wrong results. 

Figure 9 shows the quantile plot of scores. As a portfolio is biased towards 
the verifiers that compute results fast and not towards correctness, we see the 
sequential portfolio combinations starting from farthest in the left, i.e., having the 
most negative score, or most wrong results. CPACHECKER has the least number 
of wrong results, and because of it its starting point is farthest to the right. 
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Table 2: Sequential portfolios of different sizes with CPACHECKER 


: CPACHECKER Sequential Portfolio of 
Verifier 2 3 4 8 
Score 9 040 9198 9519 9522 8 349 
Correct results 5 652 6058 6 239 6 275 6 084 

Correct proofs 3516 3 780 3920 3903 3721 
Correct alarms 2 136 2278 2319 2372 2 363 
Wrong results 8 26 26 27 61 
Wrong proofs 0 14 14 14 30 
Wrong alarms 8 12 12 13 31 
Total resource consumption for correct results 
CPU time(h) 190 240 260 240 190 
Wall time (h) 140 190 210 190 150 
Memory (GB) 7 000 8900 8600 8 500 7 600 
CPU Energy (KJ) 7'700 9700 11000 10000 7900 
Median resource consumption for correct results 
CPU time(s) 61 95 100 100 97 
Wall time (s) 32 54 69 70 54 
Memory (MB) 600 920 930 910 840 
CPU Energy (J) 590 920 1100 1100 920 
Resource consumption of correct results per score point 
CPU time (s/sp) 77 95 97 90 82 
Wall time (s/sp) 55 72 78 72 64 
Memory (MB/sp) 780 970 910 890 920 
CPU Energy (J/sp) 850 1100 1100 1100 950 


Figure 10 shows that CPACHECKER is more resource efficient in comparison to the 
sequential portfolio. The sequential combination with best score is performing 
worst in resource efficiency. 


4.4 RQ 2: Evaluation of Parallel-Portfolio Verifier 


We now present the results of the parallel-portfolio verifiers. The parallel portfolio, 
mostly, achieves worse score than the best performing standalone tool. But the 
parallel portfolio with 3 tools scores better. The parallel portfolio is affected by 
two aspects: (1) size of the parallel portfolio — if too many tools are used then 
any of them would not get enough resources to verify the task, (2) selection of 
tools — if there is a fast tool that produces a lot of wrong results it reduces the 
score. Parallel portfolio, in general, produces more wrong results; even more than 
sequential portfolio, as the tools are running in parallel, whereas in sequential 
portfolio this can be somewhat mitigated by putting a more sound tool before a 
less sound tool. Table 3 shows the summary of results for the parallel portfolios. 

Figure 11 shows that parallel portfolios have many more wrong results when 
compared to CPAcHEckeER. Interestingly, the graph for ParPortfolio-3, the best 
performing parallel portfolio, remains lower than CPACHECKER, i.e., it takes less 
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Fig. 9: Sequential portfolios: Score-based quantile plot comparing the best and 
the worst sequential portfolio (SeqPortfolio-4 and SeqPortfolio-8, respectively) 
with the best performing standalone tool (CPACHECKER) 


7500 4 1504 150 4 1500 4 r 1500 


6000 4 120 4 120 4 1200 4 F 1200 


4500 4 900 

3000 4 600 

1500 4 30 4 30 4 300 4 F 300 
04 04 04 04 FO 
Unsolved CPU time Wall time Memory Energy 
tasks (s/sp) (s/sp) (MB/sp) (/sp) 


-© CPAchecker é SegPortfolio-4 -} SeqPortfolio-8 


Fig. 10: Sequential portfolios: Parallel-coordinates plot showing unsolved tasks and 
resource consumption per score point for best and worst portfolio (SeqPortfolio-4 
and SeqPortfolio-8, resp.) and the best standalone tool (CPACHECKER) 


CPU time. This is because the parallel portfolio takes results of the most efficient 
tool. Figure 12 shows that the best performing parallel portfolio performs better 
than CPACHECKER in terms of resource efficiency except memory consumption. 


4.5 RQ 3: Evaluation of Algorithm Selection Verifier 


We now present the results of the algorithm-selection verifier. Table 4 shows the 
summary of results for algorithm selection: There is a clear trend of better results 
with more verifiers. This is expected because our selector that was trained using 
machine learning has more options to choose from, and can choose the better 
one. Also, an algorithm-selection verifier does not need to share resources for the 
verification task. It needs to perform the prediction, which takes some resources; 
but after this step all the provided resources are available to the verifier. The 
number of wrong results is also comparable with CPACHECKER, as the training 
process is biased towards selecting the verifiers that are correct. 

In Fig. 13, all the plots start from around similar scores but at different times. 
Initially, CPAcHECKER performs better with respect to CPU time, but after 
around half the scores, algorithm selection starts being more efficient. Figure 14 
shows that algorithm selection is also more resource efficient than CPACHECKER. 
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Table 3: Parallel portfolios of different size with CPACHECKER 


Verifier CPACHECKER Parallel Portfolio of 
2 3 4 8 
Score 9 040 8 969 9 459 8 952 7547 
Correct results 5 652 6101 6 363 6 001 5 367 
Correct proofs 3516 3 780 3 992 3 639 3 236 
Correct alarms 2 136 2321 2371 2 362 2131 
Wrong results 8 36 35 28 42 
Wrong proofs 0 21 21 15 24 
Wrong alarms 8 15 14 13 18 
Total resource consumption for correct results 
CPU time(h) 190 160 170 250 280 
Wall time (h) 140 74 61 74 64 
Memory (GB) 7 000 8 900 11000 14000 11000 
CPU Energy (KJ) 7 700 5 400 5 200 6 500 6 400 
Median resource consumption for correct results 
CPU time(s) 61 18 16 70 130 
Wall time (s) 32 5.2 4.6 16 23 
Memory (MB) 600 430 420 1000 1300 
CPU Energy (J) 590 140 120 470 780 
Resource consumption of correct results per score point 
CPU time (s/sp) TT 65 66 99 130 
Wall time (s/sp) 55 30 23 30 31 
Memory (MB/sp) 780 1000 1200 1500 1400 
CPU Energy (J/sp) 850 600 550 720 850 


4.6 Discussion 


The experiments show that each of the compositions has a configuration that can 
perform better than any standalone tool in terms of correctly solved tasks. Initially, 
we thought that portfolios would be less resource efficient than standalone tools, 
and, in particular, would not be able to solve hard tasks as the resources allocated 
to each tool would be less. But the experimental data support the opposite: The 
benchmark set had a few such tasks: for most of the tasks that were hard for 
one tool, there was some other tool that solved it in the given time. This was 
especially pronounced in the parallel portfolio. The verifiers in the portfolios have 
to be selected with different strengths, otherwise there is no benefit, it might 
even perform worse. 

Both the portfolios prefer fast results, as there is no selector. To mitigate this, 
one needs to either select the tools carefully or add a validation step. 

Our algorithm selection was based on a model trained using machine learning. 
The training penalized the tools that produced more incorrect results, but it did 
not consider the resource consumption of these tools. In comparison to both the 
portfolios, the verifier based on algorithm selection produced much less incorrect 
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Fig. 11: Parallel portfolios: Score-based quantile plot comparing the best and the 
worst performing parallel portfolios (ParPortfolio-3 and ParPortfolio-8, respec- 
tively) with the best performing standalone tool (CPACHECKER) 
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Fig. 12: Parallel portfolios: Parallel-coordinates plot showing unsolved tasks and 
resource consumption per score point of best and worst portfolio (ParPortfolio-3 
and ParPortfolio-8, resp.) and the best standalone tool (CPACHECKER) 


results. We think if we used the resource consumption data in our training, 
the verifier based on selection would have consumed less resources. Our verifier 
combinations are easy to construct by simply selecting tools that complement 
each other well. Although this strategy is simple, we found that it still leads to 
successful combinations for all evaluated combination types. Nevertheless, the 
combinations can be further fine-tuned to achieve even better results. 

The portfolio compositions are easy to construct, and with a well diversified 
tool selection, portfolios can perform good. Also, the portfolios should not be 
too large unless we are willing to increase the resources. On the other hand, 
training the selection requires more preliminary work but with limited resources 
and enough choice (number of tools) the selection-based verifier works better. 


5 Threats to Validity 


External Validity. A combination of tools can only be as good as the parts it is 
combined from. Therefore, the concrete instantiation of our tool combinations is 
limited by the selected tools and their configuration. We have selected eight of the 
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Table 4: Algorithm-selection-based verifiers of different sizes with CPACHECKER 


F CPACHECKER Algorithm Selection of 
Verifier 2 3 4 3 
Score 9040 9 226 9689 9 816 9 886 
Correct results 5 652 5 904 6 086 6125 6214 

Correct proofs 3516 3658 3 843 3 867 3 896 
Correct alarms 2136 2 246 2 243 2 258 2318 
Wrong results 8 15 11 8 11 
Wrong proofs 0 6 4 3 3 
Wrong alarms 8 9 7 5 8 
Total resource consumption for correct results 
CPU time(h) 190 200 200 200 210 
Wall time (h) 140 160 160 150 170 
Memory (GB) 7000 6 900 6 900 6 200 6 000 
CPU Energy (KJ) 7 700 8 200 8 600 8 400 9 000 
Median resource consumption for correct results 
CPU time(s) 61 AT 48 66 55 
Wall time (s) 32 30 30 35 42 
Memory (MB) 600 740 700 550 420 
CPU Energy (J) 590 490 500 660 620 
Resource consumption of correct results per score point 
CPU time (s/sp) 77 77 76 73 76 
Wall time (s/sp) 55 61 61 56 63 
Memory (MB/sp) 780 750 720 630 600 
CPU Energy (J/sp) 850 890 890 850 910 


most powerful verification tools as determined by the annual software-verification 
competition, and executed them in the original configuration as submitted to 
the competition. Furthermore, our evaluation results only hold for the given 
benchmark set. While we have evaluated our tool combinations on programs taken 
from one of the largest and diverse verification benchmarks publicly available, the 
performance of the evaluated combinations might differ on other sets of tasks. 
Similarly, this also impacts the training of our algorithm selector. The training 
of a learning-based algorithm selector, which we employ for tool combinations 
based on algorithm selection, requires a large and diverse set of verification tasks; 
and each task has to be labeled with the execution results of each tool in our 
combination. The used benchmarks repository® was created by the efforts of 
the verification community over many years. We are not aware of any other 
benchmark set of verification tasks that is as diverse as this one. As a result, we 
had to train our algorithm selector on the same dataset that we later use for 
benchmarking the tool combinations. Therefore, we only showed that algorithm 
selection improves the performance of verification on the given benchmark set 


6 https: / /gitlab.com/sosy-lab/benchmarking/sv- benchmarks 


64 Dirk Beyer, Sudeep Kanav, and Cedric Richter 


1000 


-© CPAchecker 
>é AlgoSelection-8 


-E AlgoSelection-2 
1005 


Min time ins 


10F 


1 


T T T r T 
—2000 0 2000 4000 6000 8000 10000 
Cumulative Score 


Fig. 13: Algorithm-selection-based verifiers: Score-based quantile plot comparing 


the best and the worst performing portfolio (AlgoSelection-8 and AlgoSelection-3, 
respectively) with the best performing standalone tool (CPACHECKER) 
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Fig. 14: Algorithm-selection-based verifiers: Parallel-coordinates plot showing 
unsolved tasks and resource consumption per score point of the best and the worst 
performing algorithm selection (AlgoSelection-8 and AlgoSelection-2, respectively) 
and the best performing standalone tool (CPACHECKER) 


and the selector might only generalize to a set of tasks with similarly distributed 
verification tasks. For a fair comparison, we (1) restricted the training to linear 
models, which are known to generalize well, (2) train only on a random subset 
of the benchmark, and (3) cross validated our model over multiple benchmark 
splits. The variance of selection performance between different splits was less 
than 1%. Therefore, the performance of our trained algorithm selector is likely 
independent of the random subset selected for training. 


Finally, the evaluation of algorithm selection is dependent on the chosen 
selection methodology and choosing alternative selection methods, for example, 
based on hand-crafted rules, might impact the evaluation. However, the design 
of hand-crafted methods is not straightforward and might require deep expert 
knowledge about the tool implementation. Depending on the human designer, this 
design process might in addition be biased in favor of certain tool combinations, 
which could also impact the experimental results. 


For sequential portfolios, we ordered verifiers in sequence according to their 


performance in SV-COMP 2021. Changing the order of the tools might change 
the results with respect to resource consumption as well as soundness. 
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Internal Validity. We have used the same verifier archives, benchmark set, bench- 
marking framework, resource limits, and infrastructure to execute our experiments 
as was used in SV-COMP 2021. This minimizes the influence of a changing en- 
vironment on our experiments, allowing us to compare results of our verifier 
combinations to the results of the standalone tools from SV-COMP 2021. 

CoVeERITEAM induces an overhead of about 0.88 for each actor in the composi- 
tion, and around 44 MB memory overhead [15]. It is possible that one can reduce 
this overhead by using shell scripts, but we decided in favor of using CoVERITEAM 
for composing tools because of the modular design. This is especially pronounced 
in our algorithm-selector composition. We could have saved a few seconds if we 
were using a monolithic algorithm selector instead of composing one. 


6 Related Work 


Combination Strategies for Software Verification. Combining verifiers to increase 
the verification performance is well established in the domain of software verifica- 
tion [1,8, 20, 26, 31, 33, 46, 48, 49, 53]. In fact, the top three winning entries of the 
software-verification competition SV-COMP 2021 all combine various verification 
techniques to achieve their performance [6]. CPAchecker [8] combines up to six 
different verification appraoches into three sequential portfolios that are task- 
dependently selected with an algorithm selector. PeSCo [49] ranks verification 
algorithms according to their predicted likelihood of solving a given task and then 
executes them sequentially in descending order. Ultimate Automizer [33] employs 
an integrated tool chain of preprocessing and verification algorithm to solve a 
given task. PredatorHP [46] and UFO [1] demonstrate that parallel portfolios 
can also be a promising strategy when running multiple specialized algorithms at 
the same time. Even though previous work showed that internal combinations 
can be successfully applied to improve the effectiveness of a single tool, we show 
that similar combinations can be effectively employed to combine ‘off-the-shelf’ 
verifiers. This gives us the unique opportunity to further increase the number of 
verifiable programs by simply combining state-of-the-art verification tools. 
Cooperative methods [20] distribute the workload of a single verification task 
among multiple algorithms to combine their strengths. For example, conditional 
model checking [11, 12, 13, 14] runs two or more verifiers in sequence, while the 
program is reduced after every step to the state space of program unexplored by 
the previous algorithm. CoVeriTest [10], a tool for test-case generation based on 
verification, interleaves multiple verifiers, while (partially) sharing the analysis 
state between algorithms. MetaVal [19] integrates verification tools for witness 
validation (i.e., to check whether a previous verifier obtained a comprehensible 
result) by instrumenting the produced witness into the verified program. While 
cooperative methods are effective for reducing the workload of a verification task, 
employing cooperative methods at tool level would require to exchange analysis 
information between tools. In general, existing verification tools are not well 
suited for this type of cooperation, which lead us to explore black-box verifier 
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combinations. In addition, we showed that non-cooperative methods can improve 
the verification effectiveness without the need to adapt the employed tools. 


Combining Algorithms Beyond Software Verification. The idea of combining algo- 
rithms to improve performance have been successfully applied in many research ar- 
eas including SAT solving [51, 54, 56], constraint-satisfaction programs [21, 45, 57] 
and combinatorial-search problems [41]. Employed approaches traditionally fo- 
cused on portfolio-based approaches [21,51,54], but recent techniques started 
to integrate algorithm selectors for either selecting single algorithms [45,56] or 
portfolios of algorithms [44,57]. For example, earlier works in SAT solving [51, 54] 
focused on parallel-portfolio solvers, while later works such as SATzilla [56] fur- 
ther improves the solving process by selecting a task-dependent solver. However, 
existing techniques often employ hybrid strategies between portfolios and algo- 
rithm selection to achieve state-of-the-art performance. Therefore, Kashgarani 
and Kothoff [38] have recently shown that parallel portfolios are generally bottle- 
necked by the available resources and that a pure algorithm selector that selects 
a single algorithm performs better. While we observed that portfolios of software 
verifiers are also restricted by available resources (i.e., the performance generally 
stops to improve after a certain portfolio size), we found that all evaluated 
combination types yield a similar performance gain when configured correctly. 


7 Conclusion 


This paper describes a method to construct combinations of verification tools in 
a systematic and modular way. The method does not require any changes to the 
verification tools that are used to construct the combinations. Our experimental 
evaluation shows that all three considered combinations (sequential portfolio, 
parallel portfolio, and algorithm selection) can lead to performance improvements. 
The improvements can be significant although the construction does not require 
significant development effort, because we use COVERITEAM for the combination 
and execution of verification tools. We hope that our contribution makes it easy 
for practitioners to get access to the best performance out of the latest research 
and development efforts in software verification. 
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Abstract. Software doping is a phenomenon that refers to the presence 
of hidden software functionality, whose existence is only in the interest of 
the manufacturer. The most prominent example is the diesel emissions 
scandal. There is a need for methods that identify software doping, and 
such methods are bound to be applied to the final product with no or rare 
knowledge about its internals. Black-box analysis techniques have recently 
been developed for this purpose, harvesting the formal foundations of 
software doping. This paper integrates them with established falsification 
techniques for the purpose of real-world applicability. With a focus on 
the diesel scandal and emissions tests on chassis dynamometers we make 
the testing procedures significantly more effective in terms of time and 
cost. The theoretical results are implemented in a prototypical doping 
tester. 


1 Introduction 


Embedded software is the innovation driver of our times. Software-defined systems 
are permeating our communication, perception, and storage technology as well 
as our personal interactions with technical systems at an unprecedented pace. 
“Software-defined everything” is among the hottest buzzwords in IT technology 
today [2,18]. 

There is a tremendous problem hiding behind this apparently unstoppable 
trend: The owners of the physical “hull” of everything will not be the ones owning 
the software defining everything, nor will they have the right to look at what 
and how everything is defined. This is because commercial software typically 
is protected by intellectual property rights of the software manufacturer. This 
prohibits any attempt to disassemble the software or to reconstruct its inner 
working, albeit it is the very software that is forecasted to be defining everything. 
The use of machine-learnt software components amplifies the problem considerably. 
Since commercial interests of the software manufacturers seldomly are aligned 
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with the interest of end users, the promise of software-defined everything might 
well become a dystopia from the perspective of individual digital sovereignty. 


A massive example of software-defined collective damage is the diesel emissions 
scandal. Over a period of more than 10 years, millions of diesel-powered cars have 
been equipped with illegal software that altogether polluted the environment for 
the sake of commercial advantages of the car manufacturers. At its core, this 
was made possible by the fact that only a single, precisely defined test setup 
was put in place for checking conformance with exhaust emissions regulations. 
This made it a trivial software engineering task to identify the test particularities 
and to turn off emission cleaning outside these particular conditions. This is an 
archetypical instance of software doping. 


Against this background, there is an urgent need to establish stronger and 
enforceable requirements on the systems we are interacting with, and this is indeed 
echoed in legislatory frameworks [24]. However, the roll-out of such requirements 
in everyday practice needs a firm understanding of the technological basis for 
enforcing such requirements, respectively for identifying violations thereof. 


This paper is part of ongoing research addressing this challenge. It harvests 
the outcomes of three recent scientific achievements: (i) formal definitions of 
software doping based on contracts enforcing well-defined software behaviour in 
the vicinity of standardised behaviour [12], (ii) a solid foundation for doping tests 
to be carried out in practice [6], and (iii) probabilistic falsification techniques 
developed to guide the search for property violations in cyber-physical system 
engineering [1,20]. By combining the above ingredients, this paper addresses 
the question how to perform cost-effective doping tests that are indeed likely to 
succeed in uncovering actual cases of doped software. It approaches this question 
both from a foundational and from a practical perspective. On the foundational 
side, we introduce a temporal hyperlogic to reason about signals which we use to 
characterise the falsifiable fragment of a software doping contract. Great care 
is taken for this to work on the actual time-discrete traces that are recorded 
from the real system which itself is running in continuous time. On the practical 
side, we discuss a novel approach to probabilistic falsification that overcomes 
the problem that in many practical cases the possibility to carry out masses of 
highly-controlled experiments with a physical system is severely limited by cost or 
time budgets. To account for this, we add a passive recording component to the 
concept of falsification which observes the system in-the-wild to propose only few 
candidate traces to be inspected under lab conditions. All this is instantiated in 
the context of automotive emissions, where lab conditions correspond to expensive 
test runs on a chassis dynamometer, while observing the system in-the-wild is 
nothing else than collecting statistics while driving on normal roads. 


The paper makes the following distinguished contributions: (i) a linear tem- 
poral logic for hyperproperties over continuous signals that enables quantitative 
reasoning across traces, (ii) a logical reformulation of the falsifiable fragment 
of a software doping contract, (iii) a probabilistic falsification technique that 
uses passive recording for cost-effective doping testing, and (iv) an exemplary 
instantiation of these concepts in the context of automotive emissions. 
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Related Work. Software doping theory provides a formal basis for enlarging the 
requirements on vehicle exhaust emissions beyond too narrow lab test conditions. 
That conceptual limitation has by now been addressed by the official authorities 
responsible for car type approval [24,25]: The old NEDC-based test procedure 
is replaced by the newer Worldwide Harmonised Light Vehicles Test Procedure 
(WLTP), which is deemed to be more realistic. WLTP replaces the NEDC test by 
a new WLTC test, but WLTC still is just a single test scenario. In addition, WLTP 
embraces so called Real Driving Emissions (RDE) tests to be conducted on public 
roads. A recently launched mobile phone app [8], LolaDrives, harvests runtime 
monitoring technology for making low-cost RDE tests accessible to everyone. 

Learning or approximating the behaviour of a system under test has been 
studied intensively. Meinke and Sindhu [19] were among the first to present a 
testing approach incrementally learning a Kripke structure representing a reactive 
system. Volpato and Tretmans [27] propose a learning approach which gradually 
refines an under- and over-approximation of an input-output transition system 
representing the system under test. The correctness of this approach needs several 
assumptions, e.g., an oracle indicating when, for some trace, all outputs, which 
extend the trace to a valid system trace, have been observed. 


2 Background 


This section introduces the necessary background regarding temporal logics for 
hyperproperties and for continuous signals, probabilistic falsification basics, and 
reviews the formal definitions of software doping. 


2.1 Temporal Logics 


Linear Temporal Logic (LTL) [22] is a popular formalism to reason about prop- 
erties of traces. A trace is an infinite word where each literal is a subset of AP, 
the set of atomic propositions. Programs are interpreted as sets Sr C (24°) of 
such traces. LTL provides expressive means to characterise sets of traces, often 
called trace properties. 


Temporal Logics for Hyperproperties. For some set of traces T, a trace property 
defines a subset of T, whereas a hyperproperty defines a set of subsets of T. In 
this way it specifies which traces are valid in combination with one another. Many 
temporal logics have been extended to corresponding hyperlogics supporting 
the specification of hyperproperties. HyperLTL [11] is such a temporal logic for 
the specification of hyperproperties of reactive systems. It extends LTL with 
trace quantifiers and trace variables that make it possible to refer to multiple 
traces within a logical formula. A HyperLTL formula is defined by the following 
grammar where 7 is drawn from a set V of trace variables: 


pus dap |Vav| @ 
$= a | =œ | dAG| X | OUSG 
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The quantifiers 3 and V quantify existentially and universally, respectively, over 
the set of traces. For example, the formula Vz. 3r’. means that for every trace 7 
there exists another trace 7’ such that ¢ holds over the pair of traces. To account 
for distinct valuations of atomic propositions across distinct traces, the atomic 
propositions are indexed with trace variables: for some atomic proposition a € AP 
and some trace variable m € V, a, states that a holds in the initial position of 
trace 7. The temporal operators and Boolean connectives are interpreted as usual 
for LTL. In particular, X ¢ means that ¢ holds in the next state of every trace 
under consideration. Likewise, ¢ U ¢’ means that ¢’ eventually holds in every 
trace under consideration at the same point in time, provided @ holds in every 
previous instant in all such traces. Further operators are derivable: F ¢ = true U ġ 
enforces ¢ to eventually hold in the future, G = —F-7¢@ enforces ¢ to always 
hold, and the weak-until operator ¢W ¢’ = @U ¢' V G¢ allows ¢ to always hold 
as an alternative to the obligation for ¢’ to eventually hold. We refer to [11] for 
the formal semantics. 


Temporal Logics over Continuous Domains. LTL enables reasoning over traces 
a € (24°)” which are of discrete nature with respect to the time domain they 
represent. With each literal in the trace representing a time step, o can equiva- 
lently be viewed as a function N —> 24°. One extension of LTL is Signal Temporal 
Logic (STL) [13,17], which instead is used for reasoning over real-valued signals 
that may change in value along an underlying time domain. A signal is a function 
s: T —R where 7 is the time domain. The time domain 7 can be either N 
(discrete-time signals), or R>o (continuous-time signals). This can be lifted to 
multi-dimensional signals w(t) = (s1 (t), .. -, Sn(t)), mapping each time point to 
some element of R”. We refer to such a w : T —> R” as a (discrete-time or 
continuous-time) trace of width n in the sequel. 

STL formulas can express properties of systems modelled as sets Sst. © (T > 
R”) of traces of some fixed width n, basically by making the atomic properties 
refer to booleanizations of the signal values. The syntax of the variant of STL 
that we use in this paper is as follows, where f € R” > R: 


ou=T|f>0| 791] oAb| Ud. 


STL replaces atomic propositions by threshold predicates of the form f > 0, 
which hold if and only if function f applied to the signal values at the current 
time returns a positive 
value. The Boolean oper- wtET 
ators and the Until oper- w,t H= f > 0 if f(si(¢),.--,8n(t)) > 0 
ator U are very similar to w,t = 7d iff w,t o 
those of HyperLTL. The witKEoAv iff w,tKdandw,tey 
Next operator X is not part ; f A j 
of STL, because “next” is w,tE d@Uwy iff exists t >ts.t.w,t =% and 
without precise meaning in for all t” € [t,t'), w,t” = @ 
continuous time. The defi- 
nitions of the derived oper- 
ators F, G and W are the 


Fig. 1: Boolean semantics of STL formulas 
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same as for HyperLTL. Formally, the Boolean semantics of an STL formula ¢ at 
time point t € T for a trace w = (s1,..., Sn) is defined inductively in Fig. 1. 


Quantitative Interpretation. STL has been extended by a quantitative seman- 
tics [1,13,14] as presented in Fig. 2. This semantics is designed in such a way 
that whenever p(¢, w,t) 4 0, its sign indicates whether w,t = ¢ holds in the 
Boolean semantics. For any STL formula ¢, trace w and time t, if p(¢, w,t) > 0, 
then w,t | ¢ holds, and if p(¢, w, t) < 0, then w,t = ¢ does not hold. For the 
scope of this paper, 

we work with the p(T,w,t) = œ 
untimed Until oper- o(f >0,w,t) = f(si(t),...,Sn(t)) 
ator, instead of al- 

lowing Uj,4) for arbi- p(o, w,t) = —p(d,w,t) 
) 
) 


trary bounds a,b € p(o ^Y, w,t) = min(p(¢, w, t), py, w, t)) 

R. With only the p(dUy,w,t) = supmin{p(y,w,t’), inf p(d,w,t”)} 
untimed Until oper- vet met) 

ator, the continuous 

and ae ee Fig. 2: Quantitative semantics of STL formulas 

tics [14] coincide. 
Robustness and Falsification. The value of the quantitative semantics can serve as 
a robustness estimate and as such be used to search for a violation of the property 
at hand, i.e., to falsify it. The 

robustness of STL formula ¢ is Algorithm 1 Monte-Carlo falsification 

its quantitative value at time Input: w: Initial trace, R: Robustness function, 
0, that is, Rg(w) := p(¢,w,0). PS: Proposal Scheme 

So, falsifying a formula ¢ for Output: w € Sst. 

a system Sst_ boils down to a 1: while R(w) > 0 do 

search problem with the goal 2 w<- PS(w) 

condition Ry(w) < 0. Success- 3 a + exp(—B(R(w") — R(w))) 
ful falsification algorithms solve = r + UniformRandomReal(0, 1) 
this problem by understanding 2 ier co then 
T 
8 


; ee w+ w 

it as the optimisation prob- end if 

lem miniMiSewessn Ro(w). Algo- g. end while 

rithm 1 [1,20] sketches an algo- 

rithm for Monte-Carlo Markov 

Chain falsification, which is based on acceptance-rejection sampling [10]. Our 
version of the algorithm works on system traces instead of an input space. An 
input to the algorithm is an initial trace w and a computable robustness function 
R. Robustness computation for finite timed traces of simulations of a system 
has been discussed in the literature [13,14]; we omit this discussion here. The 
third input PS is a proposal scheme that proposes a new trace to the algorithm 
based on the previous one (line 2). The parameter 6 (used in line 3) can be 
adjusted during the search and is a means to avoid being trapped in local minima, 
preventing to find a global minimum. Any two traces w and w’ € Sst with 
robustness values R(w) and R(w’) are sampled with probability proportional to 
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ae (lines 3-6). The algorithm seeks to minimise R over the system’s traces 
SstL, and terminates when it finds a trace with a negative robustness value, i.e., 
a trace that violates the STL property from which œR is derived. 


2.2 Software Doping 


Contracts and Robustness. Earlier work [12] has developed a formal basis for 
the purpose of characterising software doping, by providing precise definitions of 
when the system’s behaviour is clean, i.e., does not contain hidden functionalities 
not in the interest of the user. If a program exhibits behaviour that is not clean, 
it is doped. 

All cleanness definitions are based on the assumption that there is some 
well-defined and agreed standard input/output behaviour of the system. Robust 
cleanness, the cleanness definition that we work with in this paper, extends this 
behaviour to the vicinity around the inputs and outputs close to the standard 
behaviour. The definition of “vicinity” and of “standard behaviour” is assumed 
to be part of a contract between software manufacturer and user. The contract 
entails the standard behaviour, distance functions for input and output values, 
and distance thresholds to define the input and output vicinity, respectively. 
With this, a system behaviour is considered clean, if its output is (or stays) in 
the output vicinity of the standard, unless the input is (or moves) outside the 
standard’s input vicinity. 


Example 1. A concrete contract for diesel-powered cars will, for instance, enforce 
bounded deviations in exhaust emissions provided the driving profile stays in 
the bounded vicinity of the standardised tests (such as NEDC or WLTC). Recent 
experiments [6] have considered contracts based on NEDC with speed values as 
inputs and NO, emissions as output values, together with distance functions 
computing the absolute difference of speed inputs and NO, outputs, respectively, 
and value thresholds were 15km/h for inputs and 80 mg/km for outputs. 


A function d : X x X — Rso is a pseudometric function if it satisfies d(x, x) = 0, 
d(x,y) = d(y,x) and d(x,y) < d(x,z) + d(z,y) for all x, y, z E€ X. We let ofk] 
denote the k-th literal of the infinite word ø. 


Reactive Execution Model. We can view a (nondeterministic) reactive program 
as a function Sp : In” > 204") perpetually mapping inputs In to sets of outputs 
Out [12]. A contract is a tuple C = (Stdln, din, dout, Ki, Ko) Where StdIn C In” 
is the input space of the system designated to define the standard behaviour, 
din : (In* x In*) => Rso and dout : (Out* x Out*) + R>o are pseudometric distance 
functions on finite words over inputs, respectively outputs, and «ki € Rso is a 
constant defining the maximum distance to the standard input allowed, and 
similarly ko € R>o is the maximum distance between two outputs such that they 
are still considered sufficiently close. For the purpose of this paper, we assume the 
distance functions to be induced by pointwise pseudometric functions of the form 
dy, : (In x In) + Rso and dout : (Out x Out) > Rso in a past-forgetful manner. 
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Definition 1. A reactive program Sp : In® — 2(0"t®°) is robustly clean w.r.t. 
to contract C = (Stdln, din, dOut, Ki, Ko) if for all input sequences i,i’ € In” with 
i € Stdin, it holds for arbitrary k > 0 that whenever din(iljl, i [j]) < Ki for all 
j < k, then 


1. for allo € Sp(i) there exists o' € Sr(i’) such that dou(ofk], 
2. for allo’ € Sp(i’) there exists o E€ Sr(i) such that dou (o[k], 


o'[k]) < ko, and 
o'[k]) < Ko. 

The definition enforces that whenever an input i’ remains within «; vicinity 
around the standard input i, then the output sets generated by i and i’ are at 
most Ko away from each other. 


HyperLTL Characterisation. D’Argenio et al. [12] prove that the following two 
HyperLTL formulas characterise robust cleanness in the sense of Definition 1. 


Yri. Yra. Inh. Stding, > (Glins = in) A (1) 
((doue(rs +074) < Ko) W (din ina si a 

Ymi. Yra. Ir. Stding, > + (Glin (2) 
((dout(0n{ Ona) < Ko) W (dining sina) > ri) 


The non-atomic propositions in the formulas above are syntactic sugar; the input 
and output values in system Sırı give rise to a binary encoding into sets of 
atomic propositions. 


Mixzed-IO Model. The reactive execution model and the HyperLTL characterisa- 
tion above have the strict requirement that for every input, the system produces 
exactly one output. Recent work [5,6] instead considers mixed-IO models, where 
a program Sjo C (In U Out)” is a subset of traces containing both inputs and 
outputs, but without any restriction on the order or frequency in which inputs 
and outputs appear in the trace. In particular, they are not required to strictly 
alternate (but they may, and in this way the reactive execution model can be 
considered a special case). A particularity of this model is the distinct output 
symbol 6 for quiescence, i.e., the absence of an output. For example, finite be- 
haviour can be expressed by adding infinitely many 6 symbols to a finite trace. 
In this model, standard behaviour is captured by subset Std C Sio of traces 
of a system Sio. To capture the notion of robust cleanness in the mixed-IO 
model, every trace is projected into an input, respectively output domain. The 
set of input symbols contains one additional element —, that indicates that in 
the respective steps an output was produced, but masking the concrete output. 
Similarly, the set of output symbols contains the additional element —, to mask a 
concrete input symbol. Projection on inputs Ji: (In U Out)” > (In U {-})” and 
projection on outputs ļo: (In U Out)” — (Out U {-.})” are defined for all traces 

€ (In U Out)” and k € N as follows: oJi[k] := if o[k] € In then o[k] else — 
and similarly o|,[k] := if o[k] € Out then c[k] else —,.. The distance functions 
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dın and doy apply on input and output symbols or their respective masks, i.e. 
they are pseudometrics in (In U {~ }) x (InU {-;}) + RsoU {oo} and, respectively, 
(Out U {-o}) x (Out U {-.}) > Rso U {oo}. As for the reactive model, we define 
a contract formally as a tuple C = (Std, din, dout, Ki, Ko) (where StdIn is replaced 
by Std, din by din, and dout by dout). Its satisfaction is defined by the adapted 
robust cleanness definition below [6]. 


Definition 2. A system Sio C (In U Out)” is robustly clean w.r.t. contract 
C = (Std, din, dout, Ki, Ko) if and only if Std C Sio and for all o € Std, a’ € Sio 
and k > 0 it holds that whenever din(o[jl]4i, o'[jli) < ki for all j < k then 


1. there exists o” € Sio such that o'4i = o”4i and dout(o[k}Lo, o” [k]}o) < Ko, 
2. there exists o” € Std such that o}; =o"; and dowl(o'[k]}o, 0” [k]}o) < Ko- 


Def. 2 contains two requirements, numbered as 1. and 2. In the following, we 
will sometimes explicitly address either of these conditions by referring to it as 
the first, respectively second condition of robust cleanness. 


3 Logical characterisation for mixed IO 


This section discusses how to reformulate robust cleanness to make it amenable to 
probabilistic falsification. For this, we translate eq. (2) into a HyperSTL formula, 
subsequentially remove its quantifiers by means of a highly efficient parallel 
composition on the level of traces and, finally, carefully adapt this quantifier-free 
representation to the mixed-IO model. 


Hyperlogics over Continuous Domains. Previous work [21] extends STL to 
HyperSTL echoing the extension of LTL to HyperLTL. A major challenge of the 
robustness computation for HyperSTL formulas is the adequate handling of the 
continuous time domain when comparing two execution traces of a system. For 
systems that can be simulated, this can be avoided [21] by composing one or 
more copies of the simulation model in parallel to itself [11]. Snapshots of the 
composed system are effectively snapshots of the individual copies of the model at 
exactly the same time point. This approach is not available when interacting with 
(black-box) real-world cyber-physical systems (CPS). In such scenarios, a suitable 
logics is HyperSTL* [7], an extension of STL* [9], which enables the comparison 
of different time points in different traces by means of a freeze operator. We use 
a variant of this idea, but with a HyperSTL syntax similar to [21]. 


p = day | y| @ 
g:= T | f>0|-6| ¢AG| pUe. 


The meaning of the universal and existential quantifier is as for HyperLTL. A 
crucial difference to the other logics presented above is the proposition f > 0. In 
contrast to HyperLTL and to the existing definition of HyperSTL, we consider 
it insufficient to allow propositions to refer to only a single trace. In HyperLTL 
that does not cause harm, because atomic propositions of individual traces can 


WwW 
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be compared by means of the Boolean connectives. To formulate thresholds for 
real values, however, we feel the need to allow real values from multiple traces to 
be combined in the function f, and thus to appear as arguments of f. Hence, in 
our semantics of HyperSTL, f > 0 holds if and only if the result of f, applied to 
all traces quantified over, is greater than 0. For this to work formally, the arity 
of function f is the product of the trace width n and the number m of traces 
quantified over at the occurrence of f > 0 in the formula, so f : (R")™ > R. 

A trace assignment [11] M : V > Sstv is a partial function assigning traces 
of Sst, to variables. Let [7|7 := w] denote the same function as J, except that 
am is mapped to trace w. The quantitative semantics of a HyperSTL formula Y, 
at time point t € 7, for a system S C (T > R”) and a trace assignment I is 
defined inductively: 


plan. },S,H,t) = max p(p, 5, Hn = w], t) 

plVa. $,S,H,t) = min p(y, S, TI [n = wl, t) 
p(T,S, I, t) = œ 

Af >0,S,H,t) = f(H(m)(t),-.-, (am) (4) 


for dom(IT) = {m,.-.,%m\" 
p(-¢, S, IT, t) = —p(¢, S, H, t) 
ploi A^ ġ2,S, H, t) min(p(¢1,S, I, t), p(¢2, S, H, t)) 
p(d1 U ¢2,S, I, t) = sup min{p(2,S, M, t’), ink, P(r S, IE} 
t'>t wt yt! 


It is an easy exercise to show that for continuous-time signals this quantitative 
semantics of HyperSTL is a conservative extension of the quantitative semantics 
of STL discussed above. For discrete-time signals it is important to understand 
that discrete time points often represent points in continuous time. It is widely 
accepted, that this can be cast into a (strictly monotonic) timing function 
T : N > Rso [3,14]. The HyperSTL semantics given above is meaningful in a 
discrete-time setting if all traces share the same timing function. 


HyperSTL characterisation. As discussed in Section 2.2, robust cleanness is a 
hyperproperty. Recent work on testing and monitoring of robust cleanness [6] 
explains the difficulties of monitoring such hyperproperties. In essence, it turns 
out that the first condition of Definition 2 cannot be refuted by observing a real 
system. Intuitively, this is because this condition effectively puts a constraint on 
the lower bound of the size of the sets of outputs that a system must be able 
to produce whereas the second condition enforces an upper bound. A violation 
of the upper-bound constraint is irrevocable, i.e., once observed, the system 
is for sure not robustly clean. However not having observed an output that is 
larger than the lower bound, does not exclude the possibility for observing such 
an output in the future. We therefore follow [6], and focus only on the second 


1 We admit some sloppiness; the set dom(IT) should have a fixed order. 
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condition of the robust cleanness definition in our work on falsification. For 
the HyperLTL characterisation this means that we only work with the second 
formula, labelled (2). 

The HyperLTL characterisation (2) assumes the system to be a subset of 
(24°)” and works with distances between traces by means of a Boolean encoding 
into atomic propositions. We will describe how to transform the HyperLTL 
formula (2) into a HyperSTL formula, where systems are given as subsets of 
(T > R”) for some width n € N. Robust cleanness distinguishes between inputs 
and outputs, and we assume that the input set In and the output set Out are 
represented as signals of width m, respectively width l. The system space then is 
Sstt € (T > R™*"). Solely for the sake of clarity, we will in the sequel, unless 
otherwise stated, restrict to m = l = 1, i.e., In C R and Out C R, and thus work 
with a fixed width of 2, hence Sst. C (T — In x Out). 

We can assume a set Std C Sst, as given, which defines all standard behaviours 
of the system. The HyperSTL characterisation of the HyperLTL formula (2) is 
then 


Yri. Yro. Ir. Std, > 0 —> (3) 
(Glin — ing] SOA Sty, > 0) A 


((doue(Ons On) — Ko <0) W (din (int sing) — Hi > 0))) 


The quantifiers remain unchanged relative to (2). The predicate StdIn,, that 
holds if and only if 71 is a standard input, is replaced by the function Std,, 
which returns a positive value if 7, is in Std, and a non-positive value otherwise. 
The input equality requirement of 7 and 7i is ensured by globally enforcing 
lim — ine | < 0. 

Since we switched from the concept of standard inputs to the concept of 
standard traces, we must also check that 7) is a standard trace. This echoes the 
setup in Definition 2, where the second requirement asks for a trace o” € Std 
instead of a trace from Sjo, see [6] for an elaborate discussion. In the operands of 
the Weak-Until operator W, we replace the AP-encoded versions of dj, and dout 
by the original distance functions dın and dout, and we perform simple arithmetic 
operations to match the syntactic requirements of HyperSTL. 

We remark that for encoding Std}, due to the absence of the Next-operator 
in HyperSTL, it might be necessary to add a clock signal s(t) = t to traces 
in a preprocessing step, not considered here for the sake of avoiding cluttered 
notation. 


Quantifier Elimination. In many practical settings—where the different standard 
behaviours are spelled out upfront explicitly, as in NEDC and WLTC—it can be 
assumed that the number of distinct standard behaviours Std is finite (while 
there are infinitely many possible behaviours in Sst_). Finiteness of Std makes it 
possible to remove the quantifiers by enumeration, and opens the way to work 
with the STL fragment of HyperSTL, after proper adjustments. 
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Let Std = {w1,..., Wc} be an arbitrary standard set with c unique standard 
traces. We will demonstrate the quantifier elimination by substituting by the 
placeholder V (71, 72) the subformula (starting with 4m}... .) of formula (3) behind 
the second quantification. We can switch the order of the V-quantifiers without 
changing the semantics of the formula, so we are working with Yra. Vm. V (m1, T2). 
Then, by replacing the second quantifier with the infinite conjunction [23], we get 


Vra. \ V(w, 772). 


wESsTL 


The latter can be split into a finite and an infinite conjunction 


Ym A Vwa AN Vlw). (4) 


wEStd wESsrtL\Std 


Let W (m1, 72,7) be the placeholder, such that V (m1, m2) = dm}. Std, > 0 —> 
W (m1, 72,74). Unfolding V in the right (infinite) conjunction in formula (4) 
reveals 


A Jri. Std, > 0 > W(w, 72,771). 
wESsrL \Std 
It follows directly from the definition of Std, that for all w Z Std, Std is non- 


positive. Hence that fragment of the formula is trivially fulfilled, and formula (4) 
is equivalent to 


Vio. \ V(w, 72). 


we Std 


Combined with similar reasoning for the 4-operator and disjunctions we can 
altogether rewrite formula (3) into 


A V (lliu — iwl <0) A^ (5) 


wEStd w’'EStd 


((dow Ow’, 0) — ko < 0) W (diliw, i) — ri > 0))), 


where the V-quantification over mı is replaced by the conjunction over standard 
traces w, the 4-quantification of r| by the disjunction over standard traces w’, 
and the remaining V-quantification of m is eliminated by rewriting into a trace 
formula and removing the trace indices from ir, and Or,. 


Self-composition in logic. Formula (5) is not yet an STL formula, because the 
distance function dın needs to compare the trace input with inputs of constant 
traces from the set Std. A popular technique to analyse hyperproperties is 
self-composition of a system [4,15]. We use a syntactic variant of parallel self- 
composition as follows. For a trace width n, we compose the signals of the 
trace under investigation w = (s1,...,8,) and the signals of each of the, say, c 
standard traces {(511,.--,S1n),---;(Se1,-+-;Sen)} = Std. The composed trace 
then is of width n + ne, it is w = (s1, ..., Sn, S115043 Sn1s +- S103 Snc) For 
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the restricted case considered here (one-dimensional input and output signals), 
w = (i,0,i1,01,-.-, 1c, 0c) is of trace width 2+ 2c. The resulting STL formula for 
monitoring robust cleanness is 


A V (Gllia = iol <0) A (6) 


l<a<c 1<b<e 


((dout(0v, 0) — Ko < 0) W (din(ip, i) — Ki > 0))). 


Recall that a discrete time interpretation of such a formula requires all system 
traces to share the same timing function 7. 


Embedding into the mixed-IO model. The STL formula (6) still is bound to inputs 
and outputs forming pairs synchronized in time. A more realistic scenario is 
that of inputs and outputs occurring independently of each other. In particular, 
when testing a real-world CPS, the testing interface can either pass an input 
to the system under test or receive an output, but not both at the same time. 
Furthermore, certain tests require to pass a series of inputs before receiving an 
output at all [6]. The mixed-IO model supports such real-world testing scenarios. 
Mixed-IO signals are always defined in the discrete time domain. A mixed-IO 
signal s € (In U Out)” (or, equivalently, s : N — In U Out) is similar to a real- 
valued discrete-time signal but the value domain R is replaced by the domain 
In U Out. A discrete-time mixed-IO trace w = (s1,..-,8n) E€ ((In U Out)”)” is a 
tuple of n mixed-IO signals. Accordingly, predicates of the form f > 0 must use 
functions f that produce real values for mixed-IO signals. Formula (6) requires 
that all traces share the same timing function. For continuous-time signals, we 
ensure that this condition is met by transforming all traces into traces with a 
common value frequency (say, 1 Hz) by averaging the values observed in a time 
unit (of one second). Let w = (s1,...,5n) E (N > In U Out)” be a recorded 
trace with some timing function tT : N > R>o, that is sampled with at least 
one value per time unit, i.e., r(i +1) —7(i) < 1 for all i € N. This trace is 


condensed to a new trace w’ = (s1, ..., sh) with timing function 7'(i) = i, and 


si,(t) <= average ( Us (ae [t,t+1) s;(i)) for 1 < j < n, i.e., each signal si is piecewise 
constant: for each unit time interval [t,t+ 1) the signal value is set to the average 


of all signal values originally recorded in that unit time interval. 


For adjusting formula (6), let Std = {s1,...,5-} C (In U Out)” be a set of 
standard traces, each in the form of a single mixed-IO signal. Following the 
syntactic self-composition idea from above, the composition of a trace w under 
investigation with Std is the trace w = (w, s1,...,8¢) € ((InU Out)”)°*". This 
needs two subtle adjustments of the formula. First, the distances dj, and dout 
are replaced by their mixed-IO counterparts din and dout, and instead of directly 
accessing inputs and outputs, the current value is projected to the input and 
output domain, respectively. Second, since the set In and Out is opaque, the 
expression |ia — ip| is not evaluable any more, it is replaced by the distance d 
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based on dj, and dou. The resulting formula is 


A VV (G(d(sa4is soli) <O)A (7) 


l<a<ec 1<b<c 


((dout(Srtor Se) — Ro < 0) W (din(sodis shi) — ri > 0))), 
where d is defined for some ¢ > 0 as 


0, ife=y 

Ce ee din(x,y) +€, ifaAyAz,y €ln 
dou(z,y) +e, ife& AyAz,y € Out 
oO, otherwise. 


In the second and third clause of the above definition we add some positive value 
e to the result of dj, and dout, because they are pseudometrics, and diq(iz, iz) 
could be 0 even if i1 4 ig. For the correctness of formula (7), however, it is crucial 
that d(x,y) = 0 if and only if x = y. For a good performance of the falsification 
algorithm, we will nevertheless want to make use of din and dout if i1 Æ ig. We 
remark that d is not a metric, because the triangle inequality requirement now is 
violated. 

The discussion above has assembled all the details to formally back the 
following theorem, stating that a system satisfies formula (7) if and only if it 
satisfies the second condition of robust cleanness in Definition 2. 


Theorem 1. Let C = (Std, din, dout, Ki, Ko) be a contract for some system Sio C 
(In U Out)” with Std = {o1,...,a-} C Sio, and let d denote formula (7). Then, 
for allo’ € Sio, it holds (o',01,..-,0c),0 H ¢ if and only if for all o € Std and 
k > 0 such that din(o[j]4i, 0’ [J]di) < Ki holds for all j < k, there exists o” € Std 
such that ol; = 0"|; and dou(o'[k]lo, 0” [k]}o) < ko. 


Example 2. We consider C = (Std, din, dout, Ki, Ko) Where Std = {w 1, w2} contains 
the two standard traces w1 = 1; 2;3;7,0;6% and wə = 0; 1;2;3;6,6”. We here 
decorate inputs with index i and outputs with index o, i.e., wı describes a system 
receiving the three inputs 1, 2, and 3, then producing the output 7, and finally 
receiving input 0 before entering quiescence. We take 


joi = o2l, if 01,02 € Out\{d} 
dout(01, 02) = 0, if 01 =02= o Or 01 =09=06 


CO, otherwise, 


and similarly for dın. The contractual value thresholds are assumed to be Kj = 1 
and Ko = 6. 

Assume we are observing the trace w = 0; 1; 2; 6o 0; ô” to be monitored with 
formula (7). First notice, that for combinations of a and b in (7), where a Æ b, 


the subformula G(d(sqJi, syJi) < 0) is always false, because sı and so (i.e., the 
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combination of w and w2) have different values at time point 0. Hence, it remains 
to show that 


(dout(wito, Wo) — Ko < 0) W (din(widi, whi) — Ki > 0) A 


(dout(wafo, Wo) — Ko < 0) W (din(wedi, wi) — Ki > 0). 


For the first part, the input distance between inputs in w and wy is always 1 
at positions 1 to 3, it is 0 at position 4 (because ~; is compared to —) and in 
position 5 and beyond. Thus, din(wiJi, wi) — «i is always at most 0, and the 
right hand-side of the W operator is always false. Consequently, by definition of 
W, the left operand of W must always hold, i.e., dour(wilo, W/o) must always 
be less or equal to 6. This is the case for wı and w: at all positions except for 
4, —> is compared to — (or 6 to ô), so the difference is 0, and at position 4, the 
distance of 6 and 7 is 1. 

For the second W-formula, w is compared to w2. These two traces are com- 
parable only to a limited extent: the order of input and output is altered at 
the last two positions of the signals before quiescence. Hence, the right operand 
of W is true at position 4, and the formula holds for the remaining trace. For 
positions 1 to 3, the input distances are 0, because the input values are identical. 
At these positions, the left operand must hold. The values are input values, so 
—o is compared to — at each position. This distance is defined to be 0, so it 
holds that —6 < 0, and the formula is satisfied. Since both formulas hold, the 
conjunction of both holds, too, and trace w is qualified as robustly clean. There 
could however be other system traces not considered in this example, that overall 
could violate robust cleanness of the system. 


Restriction of input space. Robust cleanness puts semantic requirements on 
fragments of a system’s input space, outside of which the system’s behaviour 
remains unspecified. Typically, the fragment of the input space covered is rather 
small. To falsify the STL formula (7), the falsifier has two challenging tasks. First, 
it has to find a way to stay in the relevant input space, i.e., select inputs with a 
distance of at most «i from the standard behaviour. Only if this is assured it can 
search for an output large enough to violate the «o requirement. In this, a large 
robustness estimate provided by the quantitative semantics of STL cannot serve 
as an indicator for deciding whether an input is too far off or whether an output 
stays too close to the standard behaviour. 

The general strength of the falsification technique is its proven ability to 
discover outputs of a black-box system violating a property. That is why the 
technique is considered suitable for real-world robust cleanness tests. We can 
improve its efficiency significantly by narrowing upfront the input space the 
falsifier uses. 

In practice, test execution traces will always be finite. In previous real- 
life doping tests, test execution lengths have been bounded by some constant 
B € N [6], i.e., systems are represented as sets of finite traces S C (In U Out)? 
(which for formality reasons each can be considered suffixed with 6”). In this 
bounded horizon, we can provide a predicate discriminating between relevant 
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and irrelevant input sequences. Formally, the restriction to the relevant input 
space fragment of a system S C (In U Out)? is given by the set Instdx, = 


{w € S | dw’ € Std. Nico (din (w(k) Li, w'(k)}i) < «i)}. Since Std and B are finite, 
membership is computable. 

There are rare cases in which this optimisation may prevent the falsifier from 
finding a counterexample. This is only the case if there is an input prefix leading 
to a violation of the formula for which there is no suffix such that the whole trace 
satisfies the «| constraint. Below is a pathological example in which this could 
make a difference. 


Example 3. Apart from NO, emissions, NEDC (and WLTC) tests are used to 
measure fuel consumption. Consider a contract similar to the contracts above, but 
with fuel rate as the output quantity. Assuming a “normal” fuel rate behaviour 
during the standard test, there might be a test within a reasonable «i distance, 
where the fuel is wasted insanely. Then, the fuel tank might run empty before 
the intended end of the test, which therefore could not be finished within the x; 
distance, because speed would be constantly 0 at the end. The actually driven 
test is not in set Instg,,,, but there is a prefix within «i distance that violates the 
robust cleanness property. 


4 Diesel Emissions 


This section discusses how to tailor the generic probabilistic falsification approach 
for STL based on Algorithm 1 to the particular case of diesel emissions, and 
reports on empirical observations when putting the approach into practice. 


Robustness. In the case of diesel emissions doping, the only standard be- 
haviour is either the NEDC or the WLTC. Assuming, for example, NEDC, let 
C = ({NEDC 0 0}, Ki, Ko, din, dout) be a diesel emissions specific contract, where 
NEDC is the sequence of 1180 inputs with the kth input defining the speed of the 
car after k seconds from the beginning of the test. Here, the output o suffixed 
to NEDC is the (average) amount of emitted NO, during the NEDC drive. By 
restricting the input space to Ingnepcoo},«; aS explained in Section 3, formula (7) 
can be simplified to 


G(dout((NEDC 0 0){o, S40) — Ko < 0). (8) 


This is because the conjunction and disjunction over standard traces becomes 
obsolete for only a single standard trace. For the same reason, the requirement 
G(d(sadi, Soļi) < 0) becomes obsolete, as the compared traces are always identical. 
In the W subformula, the right proposition is always false, because of the restricted 
input space: the proposition collapses to djn(NEDCoo|;, sļi)— «i > 0) and the input 
domain Injyepcoo},x; 8 {(8) € S | Vk € [0, 1180]. din(s(k)di, (NEDCoo)(k)J;) < Ki}. 
And thus, by the definition of W and U, the W subformula is equivalent to for- 
mula (8). We implemented Algorithm 1 for the robustness computation according 
to formula (8). 
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Emissions Approximation. In practice, running tests like NEDC with real cars 
is a time consuming and expensive endeavour. Furthermore, tests on chassis 
dynamometers are usually prohibited to be carried out with rented cars by the 
rental companies. On the other hand, car emission models for simulation are 
not available to the public—and models provided by the manufacturer cannot 
be considered trustworthy. To carry out our experiments, we instead use an 
approximation technique that estimates the amount of NO, emissions of a car 
along a certain trajectory based on data recorded during previous trips with the 
same car, sampled at a frequency of 1 Hz (one sample per second). Notably, these 
trips do not need to have much in common with the trajectory to be approximated. 
A trip is represented as a finite sequence 7 € (R x R x R)* of triples, where 
each such triple (v,a,n) represents the speed, the acceleration and the absolute 
amount of NO, emitted at a particular time instant in the sample. Speed and 
acceleration can be considered as the main parameters influencing the instant 
emission of NO,. This is, for instance, reflected in the RDE regulation [16, 24] 
where the decisive quantities to validate the test route and driving behaviour 
during RDE tests are speed and acceleration. 

A recording D is the union of finitely many trips 7. We can turn such 
a recording into a predictor P of the NO, values given pairs of speed and 
acceleration as follows: 


P(v,a) = average[n | (Sv’,a’. (lu—v'| <2 Ala—a’| <2 A (v’,a’,n) € D))). 


The amount of NO, assigned to a pair (v,a) here is the average of all NO, 
values seen in the recording D for v + £ and a £, with 0 < £ < 2. To overcome 
measurement inaccuracies and to increase the robustness of the approximated 
emissions, the speed and acceleration may deviate up to 2km/h, and 2m/s?, 
respectively. This tolerance is adopted from the official NEDC regulation [26], 
which allows up to 2km/h of deviations while driving the NEDC. 


Experiment setup. To demonstrate the practical applicability of our implemen- 
tation of Algorithm 1 and our NO, approximation, we report here on two 
experiments. For the first experiment, we use recordings from Biewer et al. [8]. 
They used the app LolaDrives to perform low-cost RDE tests and recorded the 
data received from a car’s diagnosis port. Using the two RDE recordings that 
appear in their work, the above predictor can be used to estimate the NO, 
emission during NEDC to be 86mg/km. Their car was an Audi A6 Avant Diesel, 
admitted in June 2020. We rented the successor of this car model, admitted in 
2021, and recorded three low-cost RDE trips with the help of LolaDrives. The 
new version of this car turned out to have a significantly better emission cleaning 
system: the estimated amount of NO, emitted during the NEDC is 9mg/km. In 
the sequel, we will refer to the first car as A20 and to the second as A21. Car 
A20 has previously been falsified w.r.t. the RDE specification. Neither A20 nor 
A21 has been falsified w.r.t. robust cleanness. 


Contracts. Before turning to falsification, we need to spell out meaningful 
contracts. The input domain In C R* is the set of finite speed trajectories, and 
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Fig. 3: NEDC speed profile (blue, dashed) and input falsifying C for ko = 88 mg/km 
(red) with 182 mg/km of emitted NO,. 


the set Out C R represents the average amount of NO, emitted during the test. 
di, must be past-forgetful, hence only the last speed value in each trace must 
be considered. A natural distance function for inputs is din(v1, v2) = |v1 — vəl. 
Similarly, a measurement for the distance of outputs is dout(01, 02) = |01 — 09. 
Adding the necessary technicalities for the mixed-IO setting, we get din and dout 
as defined in Example 2. For «;, it turned out that k; = 15 km/h is a reasonable 
choice, as it leaves enough flexibility for human-caused driving mistakes and 
intended deviations [6]. The threshold for NO, emissions under lab conditions is 
80mg/km. The emission limits for RDE tests depend on the admission date of 
the car. Cars admitted in 2020 or earlier, must emit 168 mg/km at most, and cars 
admitted later must adhere to the limit of 120 mg/km. For our experiments, we use 
Ko = 88mg/km for A20 and ko = 40 mg/km for A21 to have the same tolerances 
as for RDE tests. Effectively, the upper threshold for A20 is 84+88 = 172mg/km, 
and for A21 the limit is 9+ 40 = 49mg/km. Notice that for software doping 
analysis, the output observed for a certain standard behaviour and the constant 
Ko define the effective threshold; this threshold is typically different from the 
threshold defined by the regulation. 


Evaluation. We modified Algorithm 1 by adding a timeout condition, i.e., if the 
algorithm is not able to find a falsifying counterexample within 3,000 iterations, 
it terminates and returns both the trace for which the smallest robustness has 
been observed and its corresponding robustness value. Hence, if falsification of 
robust cleanness for a system is not possible, the algorithm outputs an upper 
bound on how robust the system satisfies robust cleanness. 

For the concrete case of the diesel emissions, the robustness value during the 
first 1180 inputs (sampled from the restricted input space Insta,,,) is always Ko. 
When the NEDC output onepc and the non-standard output o are compared, the 
robustness value is Ko — |onepc — o| (cf., eq. (8), the quantitative semantics of 
STL, and definition of dout). Hence, for test cycles with small robustness values, 
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120 - 


Fig. 4: NEDC speed profile (blue, dashed) and input maximising NO, emissions 
to 11mg/km (red). 


we get NO, emissions o that are either very small or very large compared to 
Onepc- We ran the modified Algorithm 1 on A20 and A21 for the contracts defined 
above. For A20, it found a robustness value of —8, i.e., it was able to falsify robust 
cleanness relative to the assumed contract and found a test cycle for which NO, 
emissions of 182 mg/km are predicted. The test cycle is shown in Fig. 3. For A21, 
the smallest robustness estimate found—even after 100 independent executions 
of the algorithm—was 38, i.e., A21 is predicted to satisfy robust cleanness with 
a very high robustness upper bound. The corresponding test cycle is shown in 
Fig. 4. 


5 Conclusion & Future Work 


This paper marks an important milestone in making software doping tests of 
real-world CPS practically feasible. Regarding test execution effort, real-world 
testing of CPS is not scalable; the number of tests realistically executable is 
usually very limited. Probabilistic falsification has its strength in repetitive testing 
of a system model in a strategic way. We improved this approach by embedding 
it into a very natural problem solving strategy: Patiently observing the system 
in-the-wild for the purpose of eventually conducting a small set of doping tests in 
an even more strategic way. With this paper, we have laid the formal foundations, 
and we have carved out the aspects that dominate practical applicability. For 
the latter we focussed on the automotive emissions context. In that context, we 
are currently spending considerable effort on the acquisition of more high-quality 
training data. We are building a car data platform (CDP) as a central place 
for automotive data, which, most importantly, includes the app LolaDrives for 
convenient recording, uploading and crowd-sourcing of data. With increasing 
amounts of data collected we hope to be able to roll out predictions that are more 
and more precise. Finally, we will extend the approach to broader application 
contexts, to make software doping tests available across the wider CPS domain. 


On the Detection of Doped Software by Falsification 89 


References 


10. 


11. 


12. 


. Abbas, H., Fainekos, G.E., Sankaranarayanan, S., Ivancic, F., Gupta, A.: Proba- 


bilistic temporal logic falsification of cyber-physical systems. ACM Trans. Embed. 
Comput. Syst. 12(2s), 95:1-95:30 (2013). https://doi.org/10.1145/2465787.2465797 


. Adroit, A.: Software-defined everything (SDE) market perspective (2021- 


2027): Cisco Systems Inc, Dell Inc, EMC Corp, Extreme Networks, 
Fujitsu Ltd, Hewlett Packard Enterprise. New Mexico Tribune (2021), 
https://nmtribune.com/uncategorized/199383/software-defined-everything- 
sde-market-perspective-2021-2027-cisco-systems-inc-dell-inc-emc-corp-extreme- 
networks-fujitsu-ltd-hewlett-packard-enterprise/, Online; accessed: 2021-07-13 


. Alur, R., Henzinger, T.A.: Real-time logics: Complexity and expressiveness. In: 


Proceedings of the Fifth Annual Symposium on Logic in Computer Science (LICS 
’90), Philadelphia, Pennsylvania, USA, June 4-7, 1990. pp. 390-401. IEEE Computer 
Society (1990). https: //doi.org/10.1109/LICS.1990.113764 


. Barthe, G., D’Argenio, P.R., Rezk, T.: Secure information flow by self- 


composition. Mathematical Structures in Computer Science 21(6), 1207-1252 
(2011). https: //doi.org/10.1017/S0960129511000193 


. Biewer, S., D’Argenio, P., Hermanns, H.: Doping tests for cyber-physical systems. 


In: Parker, D., Wolf, V. (eds.) Quantitative Evaluation of Systems, 16th Interna- 
tional Conference, QEST 2019, Glasgow, UK, September 10-12, 2019, Proceedings. 
Lecture Notes in Computer Science, vol. 11785, pp. 313-331. Springer (2019). 
https: //doi.org/10.1007/978-3-030-30281-8_18 


. Biewer, S., D’Argenio, P.R., Hermanns, H.: Doping tests for cyber-physical 


systems. ACM Trans. Model. Comput. Simul. 31(3), 16:1-16:27 (2021). 
https://doi.org/10.1145/3449354 

Biewer, S., Dimitrova, R., Fries, M., Gazda, M., Heinze, T., Hermanns, H., 
Mousavi, M.R.: Conformance Relations and Hyperproperties for Doping Detection 
in Time and Space. Logical Methods in Computer Science 18(1), 14:1-14:39 (2022). 
https://doi.org/10.46298 /Imcs-18(1:14)2022 

Biewer, S., Finkbeiner, B., Hermanns, H., Kohl, M.A., Schnitzer, Y., Schwenger, 
M.: RTLola on board: Testing real driving emissions on your phone. In: Groote, 
J.F., Larsen, K.G. (eds.) Tools and Algorithms for the Construction and Analysis 
of Systems - 27th International Conference, TACAS 2021, Held as Part of the 
European Joint Conferences on Theory and Practice of Software, ETAPS 2021, 
Luxembourg City, Luxembourg, March 27 - April 1, 2021, Proceedings, Part II. 
Lecture Notes in Computer Science, vol. 12652, pp. 365-372. Springer (2021). 
https: //doi.org/10.1007/978-3-030-72013-1_20 

Brim, L., Dluhos, P., Safranek, D., Vejpustek, T.: STL*: Extending signal tem- 
poral logic with signal-value freezing operator. Inf. Comput. 236, 52-67 (2014). 
https: //doi.org/10.1016/j.ic.2014.01.012 

Chib, S., Greenberg, E.: Understanding the metropolis-hastings 
algorithm. The american statistician 49(4), 327-335 (1995). 
https: //doi.org/10.1080/00031305.1995.10476177 

Clarkson, M.R., Finkbeiner, B., Koleini, M., Micinski, K.K., Rabe, M.N., Sánchez, 
C.: Temporal logics for hyperproperties. In: Abadi, M., Kremer, S. (eds.) POST 
2014. LNCS, vol. 8414, pp. 265-284. Springer (2014). https://doi.org/10.1007/978- 
3-642-54792-8_15 

D’Argenio, P.R., Barthe, G., Biewer, S., Finkbeiner, B., Hermanns, H.: Is your 
software on dope? - Formal analysis of surreptitiously “enhanced” programs. In: 


90 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


S. Biewer and H. Hermanns 


Programming Languages and Systems - 26th European Symposium on Program- 
ming, ESOP 2017, Proceedings. LNCS, vol. 10201, pp. 83-110. Springer (2017). 
https: //doi.org/10.1007/978-3-662-54434-1_4 

Donzé, A., Ferrére, T., Maler, O.: Efficient robust monitoring for STL. In: Shary- 
gina, N., Veith, H. (eds.) Computer Aided Verification - 25th International Con- 
ference, CAV 2013, Saint Petersburg, Russia, July 13-19, 2013. Proceedings. 
Lecture Notes in Computer Science, vol. 8044, pp. 264-279. Springer (2013). 
https: / /doi.org/10.1007/978-3-642-39799-8 19 

Fainekos, G.E., Pappas, G.J.: Robustness of temporal logic specifications 
for continuous-time signals. Theor. Comput. Sci. 410(42), 4262-4291 (2009). 
https: //doi.org/10.1016/j.tcs.2009.06.021 

Finkbeiner, B., Rabe, M.N., Sánchez, C.: Algorithms for model checking HyperLTL 
and HyperCTL*. In: Kroening, D., Pasareanu, C.S. (eds.) CAV 2015. LNCS, 
vol. 9206, pp. 30-48. Springer (2015). https: //doi.org/10.1007/978-3-319-21690-4_3 
Kohl, M.A., Hermanns, H., Biewer, S.: Efficient monitoring of real driving emis- 
sions. In: Colombo, C., Leucker, M. (eds.) Runtime Verification - 18th Interna- 
tional Conference, RV 2018, Limassol, Cyprus, November 10-13, 2018, Proceedings. 
Lecture Notes in Computer Science, vol. 11237, pp. 299-315. Springer (2018). 
https: //doi.org/10.1007/978-3-030-03769-7_17 

Maler, O., Nickovic, D.: Monitoring temporal properties of continuous signals. In: 
Lakhnech, Y., Yovine, S. (eds.) Formal Techniques, Modelling and Analysis of Timed 
and Fault-Tolerant Systems, Joint International Conferences on Formal Modelling 
and Analysis of Timed Systems, FORMATS 2004 and Formal Techniques in Real- 
Time and Fault-Tolerant Systems, FTRTFT 2004, Grenoble, France, September 
22-24, 2004, Proceedings. Lecture Notes in Computer Science, vol. 3253, pp. 152-166. 
Springer (2004). https://doi.org/10.1007/978-3-540-30206-3_12 

Mathews, M.: Are You Ready for Software-Defined Everything? Wired, 
https: //www.wired.com/insights/2013/05 /are-you-ready- for-software-defined- 
everything/, Online; accessed: 2021-07-13 

Meinke, K., Sindhu, M.A.: Incremental learning-based testing for reactive systems. 
In: Gogolla, M., Wolff, B. (eds.) Tests and Proofs - 5th International Confer- 
ence, TAP@TOOLS 2011, Zurich, Switzerland, June 30 - July 1, 2011. Proceed- 
ings. Lecture Notes in Computer Science, vol. 6706, pp. 134-151. Springer (2011). 
https: //doi.org/10.1007/978-3-642-21768-5-11 

Nghiem, T., Sankaranarayanan, S., Fainekos, G.E., Ivancic, F., Gupta, A., Pap- 
pas, G.J.: Monte-carlo techniques for falsification of temporal properties of non- 
linear hybrid systems. In: Johansson, K.H., Yi, W. (eds.) Proceedings of the 13th 
ACM International Conference on Hybrid Systems: Computation and Control, 
HSCC 2010, Stockholm, Sweden, April 12-15, 2010. pp. 211-220. ACM (2010). 
https: //doi.org/10.1145/1755952.1755983 

Nguyen, L.V., Kapinski, J., Jin, X., Deshmukh, J.V., Johnson, T.T.: Hyperproperties 
of real-valued signals. In: Talpin, J., Derler, P., Schneider, K. (eds.) Proceedings of 
the 15th ACM-IEEE International Conference on Formal Methods and Models for 
System Design, MEMOCODE 2017, Vienna, Austria, September 29 - October 02, 
2017. pp. 104-113. ACM (2017). https://doi.org/10.1145/3127041.3127058 
Pnueli, A.: The temporal logic of programs. In: 18th Annual Symposium 
on Foundations of Computer Science, Providence, Rhode Island, USA, 31 
October - 1 November 1977. pp. 46-57. IEEE Computer Society (1977). 
https://doi.org/10.1109/SFCS.1977.32 

Rosen, K.H., Krithivasan, K.: Discrete mathematics and its applications: with 
combinatorics and graph theory. Tata McGraw-Hill Education (2012) 


24. 


25. 


26. 


27. 


On the Detection of Doped Software by Falsification 91 


The European Parliament and the Council of the European Union: Commission 
Regulation (EU) 2017/1151 (June 2017), http://data.europa.eu/eli/reg/2017/1151/ 
oJ 

Tutuianu, M., Bonnel, P., Ciuffo, B., Haniu, T., Ichikawa, N., Marotta, A., Pavlovic, 
J., Steven, H.: Development of the world-wide harmonized light duty test cycle 
(wltc) and a possible pathway for its introduction in the european legislation. 
Transportation Research Part D: Transport and Environment 40(Supplement C), 
61 — 75 (2015). https://doi.org/10.1016/j.trd.2015.07.011 

United Nations: UN Vehicle Regulations - 1958 Agreement, Revision 2, Addendum 
100, Regulation No. 101, Revision 3 — E/ECE/324/Rev.2/Add.100/Rev.3 (2013), 
http://www.unece.org/trans/main/wp29/wp29regs101-120.html 

Volpato, M., Tretmans, J.: Approximate active learning of nondeterministic input 
output transition systems. Electron. Commun. Eur. Assoc. Softw. Sci. Technol. 72 
(2015). https: //doi.org/10.14279/tuj.eceasst.72.1008 


Open Access This chapter is licensed under the terms of the Creative Commons 


Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 


which permits use, sharing, adaptation, distribution and reproduction in any medium or 


format, as long as you give appropriate credit to the original author(s) and the source, 


provide a link to the Creative Commons license and indicate if changes were made. 


The images or other third party material in this chapter are included in the 


chapter’s Creative Commons license, unless indicated otherwise in a credit line to the 


material. If material is not included in the chapter’s Creative Commons license and 


your intended use is not permitted by statutory regulation or exceeds the permitted 


use, you will need to obtain permission directly from the copyright holder. 


®) 


Check for 
updates 


Vege SN 
M/S © 
{ Ai, © 


uig ue 
Estimating Worst-case Resource Usage by 
Resource-usage-aware Fuzzing* 


Liqian Chen! (®), Renjie Huang’?, Dan Luot, Chenghu Ma!”, 
Dengping Wei! (®), and Ji Wang!? 
1 College of Computer Science, National University of Defense Technology, 


Changhsha, China 
{lqchen,renjiehuang, luodan,machenghu, dpwei,wj}@nudt.edu.cn 


? State Key Laboratory of High Performance Computing, Changhsha, China 


Abstract. Worst-case resource usage provides a useful guidance in the 
design, configuration and deployment of software, especially when it runs 
under a context with limited amount of resources. Static resource-bound 
analysis can provide sound upper bounds of worst-case resource usage 
but may provide too conservative, even unbounded, results. In this paper, 
we present a resource-usage-aware fuzzing approach to estimate worst- 
case resource usage. The key idea is to guide the fuzzing process using 
resource-usage amount together with resource-usage relevant coverage. 
Moreover, we leverage semantic patch to make use of static analysis in- 
formation (including control-flow, function-call, etc.) to instrument the 
original program, for the sake of aiding the subsequent fuzzing. We have 
conducted experiments to estimate worst-case resource usage of various 
resources in real-world programs, including heap memory, stack depths, 
sockets, user-defined resources, etc. The preliminary experimental results 
show the promising ability of our approach in estimating worst-case re- 
source usage in real-world programs, compared with two state-of-the-art 
fuzzing tools (AFL and MemLock). 


Keywords: Fuzzing - Resource Usage - Static Analysis 


1 Introduction 


Resources refer to any abstractions offered to a process by system calls, apart 
from the process itself. Typical resources in practice include heap/stack memory, 
sockets, file descriptors, threads, database connections, gas consumed in Solidity 
smart contracts, etc. In addition, there exist a variety of user-defined application- 
dependent resources in applications, such as buffers, memory pools, number of 
licenses consumed, etc. Worst-case resource usage provides a useful guidance 
in the design, configuration and deployment of software, especially when the 
software runs with limited amount of resources, e.g., under the context of modern 
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cyber-physical systems, mobile systems and IoT devices, etc. Unexpected or 
uncontrolled resource usage may degrade program performance, or even leads to 
CWE (Common Weakness Enumeration) vulnerabilities (such as uncontrolled- 
resource-consumption, file-descriptor-exhaustion, etc.). 

Static resource-bound analysis can provide sound upper bounds of worst- 
case resource usage but may provide too conservative, even unbounded, results. 
Moreover, most of existing static resource-bound analysis techniques [1,2,4,5,8, 
9,14] focus on deriving the upper-bound number of accesses to a given control 
location or simply the bound of iterations of a loop (or recursion). The programs 
under analysis are often of small-scale, and complex syntactic constructs are 
usually being abstracted away for simplicity. 

In real-world programs, resources are often manipulated via specific APIs 
which may involve complex structures. Moreover, the usage amount of resources 
often depends on not only such parameters, but also the running system environ- 
ment. For example, considering malloc(n) in C programs, its actual allocation 
amount of heap memory depends on the running environment (due to factors 
such as alignment, the current first available free slot, etc.) and is somehow non- 
deterministic before execution. The allocation may fail or may allocate memory 
with size larger than n (e.g., due to alignment). In such cases, dynamic analysis 
methods are highly desired. 

In this paper, we present a resource-usage-aware fuzzing approach to estimate 
worst-case resource usage. We use resource-usage amount together with resource- 
usage relevant coverage to guide the fuzzing process, so as to generate inputs 
triggering large resource-usage amount. More clearly, we use a different defini- 
tion of branch coverage and additionally add resource-usage amount to guide the 
fuzzing process. Moreover, we also leverage semantic patch [11] to make use of 
static analysis information (including control-flow, function-call, etc.) to instru- 
ment the original program. Such information is helpful in aiding the subsequent 
fuzzing during runtime. We have conducted experiments to estimate worst-case 
resource usage of various resources in real-world programs, including heap mem- 
ory, stack depths, sockets, user-defined resources, etc. Preliminary experimental 
results show the promising ability of our approach in estimating worst-case re- 
source usage in real-world programs, compared with two state-of-the-art fuzzing 
tools (AFL and MemLock). 


2 Approach 


In this section, we describe the basic process of our approach (shown in Fig. 1). 


2.1 Static analysis and instrumentation 


For the target program, we first identify all program locations (i.e., program 
points) of the calls to resource-usage operations in the program. Such resource- 
usage operations can be APIs provided by systems or libraries, as well as applica- 
tion programmer-defined APIs. From the point of view of increasing/decreasing 
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Fig. 1. Workflow of resource-usage-aware fuzzing 


resource-usage amount, all operations changing resource-usage can essentially 
be reduced into allocation (i.e., increasing) and deallocation (i.e., decreasing) 
operations. To this end, we define two basic modeling functions 


e _RAlloc(int n), to model allocating n number of resources, and 
e _RDealloc(int n), to model deallocating n number of resources. 


We will instrument invocations of these two basic modeling functions to explicitly 
model the resource usage for each resource-usage operation in the original pro- 
gram, according to its semantics. For example, to model pFile = fopen(...), we 
will instrument (afterwards) _RAlloc(pFile != NULL? 1 : 0). To model free(p), 
we will instrument (beforehand) _RAlloc(malloc_usable_size(p)), wherein the 
malloc_usable_size(p) function (which is a C library function) returns the num- 
ber of usable bytes in the block pointed to by p. To model the change of call- 
stack depths, we instrument __RAlloc(1) and _.RDealloc(1), respectively at the 
entry and exit (before return statement) of each function. Note that each time 
of resource-usage fuzzing, we consider only one type of resources. The fuzzing 
engine will track the invocations of __RAlloc(int n) and _RDealloc(int n) and 
capture their parameters to maintain the current amount and the historical peak 
amount of resource usage at runtime. 

On the other hand, many functions and basic blocks in the program are useful 
for implementing functionality of the program but not relevant to resource usage. 
Based on this insight, we propose to guide the fuzzing process to cover functions 
and basic blocks that are relevant to resource usage. 


— First, we make use of the call graph of the target program to identify the list 
of all functions that directly or indirectly invoke resource-usage operations.’ 


3 Specially, to track stack depth, we first collect a set FSet of functions that directly or 
indirectly call recursive functions. For other functions, we calculate for each function 
the depth from the main() function to that function according to the call graph, and 
add into FSet the top-K percent (e.g., top 30%) functions with large depths. 
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Then we instrument coverage-label function __covl() before the invocation of 
these functions. We use __coul() to identify basic blocks that involve resource 
usage, which will be further used to define resource-usage-aware coverage. 

— Second, for each program block containing invocations of resource-usage 
modeling functions (i.e., _-RAlloc(), _-RDealloc()), label function __covl() or 
exit function ezit() (as well as similar functions such as raise()), we instru- 
ment label function __covl() before the control-flow branch where this block 
locates in (e.g., in the then branch) and also at the beginning of the block 
in the other branch (e.g., the else branch). We conduct instrumentations of 
_-coul() in a bottom-up manner, i.e., from inside to outside blocks. 


We leverage program transformation tool Coccinelle [12], to automatically 
instrument statements invoking resource-usage modeling functions as well as 
coverage-label function __covl() into the original program. Coccinelle is a pro- 
gram matching and transformation engine which allows us to write so-called 
semantic patches [11] for specifying desired code matches and transformations. 
Particularly, the transformation engine of Coccinelle is defined in terms of con- 
trol flow, and thus it fits well to instrument coverage-label functions for desired 
control-flow branches where resource-usage locates. 


1 static SVCXPRT *makefd_xprt(int fd, u_int sendsize, 
2 u_int recvsize) 

3{ 

4... 

1 static SVCXPRT *makefd_xprt(int fd, u_int sendsize, 5 if (fd >= FD_SETSIZE) {...; return NULL; } 
2 u_int recvsize) E 42s 


{ 


7 return (xprt); 

8} 

9 

10 static bool rendezvous_request(SVCXPRT *xprt) 


3 
Baas 
5 if (fd >= FD_SETSIZE) {... ; return NULL; } 
6... 
7 return (xprt); 

8 @ accept @ 
type T; 
expression E; 
identifier id; 


@@ 


9 
10 static bool rendezvous_request(SVCXPRT *xprt) 


if ((sock = accept(xprt->xp_fd, (struct sockaddr *) 
(void *)&addr, &len)) <0) {...; return false; } 
—covl(); 
bey T RAlloc(1); 
if ((sock = accept(xprt->xp_fd, (struct sockaddr *) i 
(void *)&addr, &len)) <0) {...; return false; } newxprt = makefd_xprt(sock, r->sendsize, r->recvsize); 
—covl(); 
if (newxprt==NULL){ 
—covl(); 
raise(SIGSEGV); //simulating CVE-2018-14622 


( 

if (E = accept(...)) < O){ ... } 
+ covi(); 

T RAlloc(1); 


newxprt = makefd_xprt(sock, r->sendsize, r->recvsize); 
if (newxprt==NULL){ 
raise(SIGSEGV); //simulating CVE-2018-14622 


(a) Original Program (b) Semantic Patch (c) Instrumented Program 


Fig. 2. Example illustration 


Example illustration Fig. 2 illustrates the above process via an example 
(named libtirpc_slice) extracted from an old version of libtirpc (that is a Transport- 
Independent RPC library for Linux) which contains a known CVE vulnerabil- 
ity 4. The cause of this CVE is that the return value of makefd_xprt() was 
not checked in all instances, which could lead to a crash when the server ex- 
hausted the maximum number of available file descriptors. Fig. 2(a) shows the 
slice extracted from the original code of libtirpc. Fig. 2(b) shows part of the se- 
mantic patch applied for instrumentation. The instrumented program is shown 


* https: //ubuntu.com/security /CVE-2018-14622 
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in Fig. 2(c). This program consumes socket connections, e.g., by calling accept() 
as shown on Line 13 in Fig. 2(a). We use semantic patch shown in Fig. 2(b), 
to instrument resource-usage modeling function __RAlloc(1) as well as coverage- 
label function —_covl() at the program location when a connection is established 
successfully. The instrumented code is highlighted in Fig. 2(c). 


2.2 Fuzzing loop 


Algorithm 1 Resource-usage Aware Fuzzing 


Require: an instrumented program P, and a set of initial seeds Io 
Ensure: (maz_res, BuggyS) where maz_res is the found largest resource usage amount, and BuggyS 
is a set of test cases triggering resource-usage bugs 


1: maz_res + 0 

2: BuggyS + 0 

3: SeedQueue + Io 

4: while time not expire do 

5: s + select(SeedQueue) 

6: s’ + mutate(s) 

G trace + execute(s’) 

8: n-res + resPeak(trace) 

9: if n_res > maz-_res then 

10: man_res + n_res 

11: SeedQueue + SeedQueue U s’ 
12: else 

13: if find_new_path(trace) then 
14: SeedQueue + SeedQueue U s’ 
15: end if 

16: end if 

17: if trigger-crash(trace) then 

18: BuggyS + BuggyS U s’ 

19: end if 


20: end while 
21: return (maz_res, BuggyS) 


Algorithm. 1 shows the main procedure of our resource-usage aware fuzzing. 
The algorithm first selects an input s from the seed pool SeedQueue, mutates it 
and generates a mutant s’. Then, the fuzzer runs the mutant input and moni- 
tors its execution. If the mutant input consumes more resources or leads to new 
resource-usage-aware coverage, it will be added to the seed pool as an interest- 
ing input. This process is similar to the process of traditional coverage-based 
grey-box fuzzers (e.g., AFL). The main difference lies in that resource-usage 
aware fuzzer uses a different definition of branch coverage and adds resource 
consumption guidance to retain interesting inputs. Now we give the details. 


Resource-usage aware coverage Traditional coverage-based grey-box fuzzers 
use instrumentation to capture basic block transitions, and log edge coverage in- 
formation during runtime. For example, AFL uses a random number to represent 
each basic block, and each transition from one basic block to another is marked 
by the Exclusive-OR (and right shift) result of the two random values. The iden- 
tifier of each transition is considered as an address and each time of triggering 
will increment the count of hits at that address. During runtime, AFL records 
edge coverage information, including whether the edge has been visited, and the 
count of hits. 
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In this paper, we concentrate only on resource usage in a program, while 
many basic blocks in the program are useful for implementing functionality of 
the program but not relevant to resource usage. Based on this insight, we log 
only transitions between those basic blocks that contain resource-usage modeling 
functions (i.e., _RAlloc(), --RDealloc()), coverage-label function --covl() and 
exit() function. E.g., consider an execution trace B7, B2,..., Bn-1, BZ wherein 
only By, By contain aforementioned resource-usage relevant functions. We will 
log it as a transition from By to B}, and increase the count of hits of this 
transition. Resource-usage-aware edge coverage is more delicate and sensitive 
than traditional edge coverage in identifying different resource usage. 


Resource-usage amount guidance When resource-usage aware fuzzer runs 
an input on the instrumented program, it collects not only the resource-usage 
aware coverage information, but also resource-usage amount. The fuzzing engine 
maintains two variables, resc_cur and resc_peak, to track respectively the cur- 
rent amount and the historical peak amount of resource usage. It captures the 
parameters of _-RAlloc(n) and --RDealloc(n), and updates the current amount 
as well as the historical peak amount of resource usage. 


Overall guidance mechanism As shown in Algorithm. 1, after execution 
over an input s’, we collect the peak resource usage amount of the running trace 
through resPeak(trace) (Line 8). If this input leads to more resource usage, it is 
added into the seed pool for further mutation (Lines 9-11). Besides, if it leads to 
new resource-usage aware coverage, it is also added into the seed pool for further 
mutation (Lines 13-14). In addition, if the input triggers a crash, it is added into 
BuggyS which collects the set of test cases triggering resource-usage bugs. 


3 Experiments 


We have implemented our approach in a prototype fuzzer named ResFuz °, based 
on MemLock [16] which is built on top of AFL [17]. We employ Coccinelle [12] 
to conduct program instrumentation. 

We conduct preliminary experiments on several open-source software, includ- 
ing jasper, openjpeg and yara, which are also part of the benchmark used in [16], 
as well as the small example libtirpc_slice explained in Fig. 2. More specifically, 
jasper and openjpeg contain many heap resource operations, while yara contains 
recursive functions. Moreover, jasper and openjpeg contain many user-defined 
application-specific resource-usage operations. E.g., jasper uses operations like 
jas_malloc(), jas_free() to manage a heap memory pool with a user-configurable 
size. Similarly, openjpeg uses operations like opj_malloc(), opj_free() to manage 
a specific type of heap memory. The small program libtirpc_slice contains socket 
operations, as explained in Sect. 2.1. We compare ResFuz against other two 
state-of-the-art fuzzers, namely AFL and MemLock [16]. All our experiments 


5 The artifact is available at https://doi.org/10.5281/zenodo.5894821. 
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Fig. 3. The growth trend of resource usage 


have been performed on machines with an Intel (R) Core (TM) i9-10940X CPU 
(3.30GHz) and 32GB of RAM under 64-bit Ubuntu LTS 20.04. We run each 
fuzzer for 6 hours (except 10 minutes for libtirpc-slice) each time, perform each 
experiment for 3 times, and use their average statistical performance as result. 

Fig. 3 depicts the growth trend of the found resource peaks over time through 
the plots. The vertical axis shows the amount of the peak resource consumed 
(heaps for jasper and openjpeg, stack depths for yara, sockets for libtirpc_slice). 
Fig. 3 shows that ResFuz outperforms the two baseline fuzzers in finding large re- 
source consumption for almost all the cases (except for japser shown in Fig. 3(b), 
for which MemLock performs a little bit better than ResFuz). In particular, as 
shown in Figs. 3(d-f), for user-defined resources in openjpeg and jasper as well as 
sockets in libtirpc_slice, ResFuz provides much better results than the other two 
tools. This is because the guidance mechanism in ResFuz is based on resource- 
usage amount and resource-usage aware coverage information, which accelerates 
the process of adding inputs triggering large resource usage into the seed pool. 
Note that for these user-defined resources and sockets, MemLock uses the con- 
sumption of the general heap to guide the fuzzing process, while ResFuz uses 
respectively the consumption of the specific OPJ heap (in openjpeg), JAS heap 
(in japser), sockets (in libtirpc-slice) to guide the fuzzing process. 


4 Related Work 


Using dynamic analysis or fuzzing to find resource-usage relevant bugs has re- 
ceived much attention in recent years. PREDATOR [3] is an automated black 
box testing tool for detection and identification of local resource-exhaustion vul- 
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nerabilities in network servers, which computes resource usage profiles for pre- 
dicting the utilization of every monitored resource for test inputs. Radmin [7] 
confines the resource usage of a target program from its benign executions to 
the learned automata and then uses it to detect resource usage anomalies. Both 
PREDATOR and Radmin do not use fuzzing. MemFuzz [6] uses memory access 
(rather than memory consumption) instrumentation as addition to branch cover- 
age to guide evolutionary fuzzing. Recently, researchers have drawn attention to 
the algorithmic complexity vulnerabilities such as SlowFuzz [13], Singularity [15] 
and PerfFuzz [10]. The basic idea behind is to use the number of executed in- 
structions as the guidance for fuzzing. However, all these works consider time 
complexity issues. 

The most relevant work to our technique is MemLock [16], which uses mem- 
ory usage guided fuzzing to generate the excessive memory consumption inputs 
and trigger uncontrolled memory consumption bugs. MemLock also uses mem- 
ory consumption information to guide the fuzzing process and considers two 
kinds of memory resources, i.e., stack memory and heap memory. Compared 
with MemLock, we consider the usage of general resources, including memory, 
file descriptors, socket connections, user-defined resources, etc. Moreover, Mem- 
Lock uses default branch coverage of AFL (which considers transitions of all 
basic blocks) to guide the fuzzing process, while our approach adopts resource- 
usage-aware coverage (which considers transitions between basic blocks that are 
relevant to resource usage). In addition, we employ semantic patch to make use of 
resource-usage relevant call graph and control-flow graph to conduct instrumen- 
tation at source code level, while MemLock uses control-flow graph in the same 
way as AFL (to define branch coverage) and uses call graph only to determine 
stack memory usage (by instrumenting at the entry and exit of functions). 


5 Conclusion and Future Work 


In this paper, we present a resource-usage-aware fuzzing approach to estimate 
worst-case resource usage. It employs resource-usage amount and resource-usage- 
aware coverage to guide the fuzzing process, for the sake of generating inputs 
to triggering massive resource usage. Moreover, we employ semantic patches to 
make use of resource-usage relevant call graph and control-flow graph informa- 
tion to conduct instrumentation, for the sake of aiding the subsequent fuzzing 
process. We have conducted experiments to estimate worst-case resource usage of 
various resources in real-world programs, including heap memory, stack depths, 
sockets, user-defined resources, etc. Preliminary experimental results show its 
promising ability to estimate worst-case resource usage in real-world programs, 
compared with two state-of-the-art fuzzing tools. 

For future work, we plan to conduct experiments on more real-world pro- 
grams and over more kinds of resources. We also plan to conduct evaluation 
comparison with more state-of-the-art fuzzing tools. Furthermore, we will eval- 
uate our approach in detecting resource-usage bugs and security-critical vulner- 
abilities in real-world programs. 
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Abstract. We present a novel approach for resolving numerical pro- 
gram sketches under Boolean and quantitative objectives. The input is 
a program sketch, which represents a partial program with missing nu- 
merical parameters (holes). The aim is to automatically synthesize values 
for the parameters, such that the resulting complete program satisfies: 
a Boolean (qualitative) specification given in the form of assertions; and 
a quantitative specification that estimates the number of execution steps 
to termination and which the synthesizer is expected to optimize. 

To address the above quantitative sketching problem, we encode a pro- 
gram sketch as a program family (a.k.a. software product line) and an- 
alyze it by the specifically designed lifted analysis algorithms based on 
abstract interpretation. In particular, we use a combination of forward 
(numerical) and backward (termination) lifted analysis of program fami- 
lies to find the variants (family members) that satisfy all assertions, and 
moreover are optimal with respect to the given quantitative objective. 
Such obtained variants represent “correct & optimal” sketch realizations. 
We present a prototype implementation of our approach within the FAM- 
ILYSKETCHER tool for resolving C sketches with numerical types. We 
have evaluated our approach on a set of benchmarks, and experimental 
results confirm the effectiveness of our approach. 


Keywords: Quantitative program sketching - Software Product Lines - 
Abstract Interpretation 


1 Introduction 


A sketch [29,30] is a partial program with missing numerical expressions called 
holes to be discovered by the synthesizer. Previous approaches for program 
sketching [29,30,17] automatically synthesize integer constant values for the holes 
so that the resulting complete program satisfies Boolean (qualitative) properties 
in the form of assertions. However, the need for considering combined Boolean 
and quantitative properties is prominent in many applications. Still, quantita- 
tive properties have been largely missing from previous approaches for program 
sketching. In particular, there has been no possibility for measuring the “good- 
ness” of solutions. Boolean properties are used to define minimal requirements 
for the synthesized complete programs. Still, there are usually many different 
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complete programs that satisfy the Boolean properties, and some of them may 
be preferred over the others. Therefore, it is important to define synthesis algo- 
rithms, which construct complete programs (solutions) that not only meet the 
Boolean properties, but are also optimal with respect to a given quantitative 
objective [2,6]. This is so-called quantitative sketching problem. 


In this paper, we use lifted static analysis based on abstract interpretation 
[25] for program families (a.k.a. software product lines) [8] to solve this quan- 
titative sketching problem. The key observation is that all possible sketch real- 
izations constitute a program family, where each numerical hole is represented 
as a numerical feature. A program family describes a set of similar programs as 
variants of some common code base [8]. At compile-time, a variant of a program 
family is derived by assigning concrete values to a set of features (configuration 
options) relevant for it, and only then is this variant compiled or interpreted. 
Program families (often in C) enriched with compile-time configurability by the 
C preprocessor CPP [8,21] are today widely used in open-source projects and 
industry [21]. By using the proposed transformation from program sketches to 
program families, we reduce the quantitative sketching problem to selecting those 
variants (family members) from the corresponding program family that satisfy 
all assertions and are optimal with respect to the given quantitative objective. As 
a quantitative objective we consider here the sufficient preconditions inferred by 
a quantitative termination analysis that estimates the efficiency of a program by 
counting upper-bounds on the number of execution steps to termination. More 
specifically, we use a combination of forward and backward lifted analysis to solve 
this problem. The forward numerical lifted analysis infers numerical invariants 
for all members of a program family, thus finding the “correct” variants that 
satisfy all assertions. Subsequently, the backward termination lifted analysis is 
performed on a sub-family of “correct” variants to infer piecewise-defined rank- 
ing functions, which provide upper-bounds on the number of execution steps to 
termination. The variants with minimal ranking function are reported as optimal 
complete programs that solve the original quantitative sketching problem. 


To find the required variants (i.e., the solution to the quantitative sketching 
problem), we use the specifically designed lifted static analysis algorithms, which 
efficiently analyze all variants of the program family simultaneously, without gen- 
erating any of them explicitly [3,24,22,28,19,11,20,16]. Lifted analysis processes 
the common code base of a program family directly, exploiting the similarities 
among individual variants to reduce analysis effort. It reports precise analysis 
results for all variants of the family. In particular, we use an efficient, abstract 
interpretation-based lifted analysis of program families with numerical features 
[16], where sharing is explicitly possible between equivalent analysis elements 
corresponding to different variants. This is achieved by using a specialized deci- 
sion tree lifted domain [16] that provides a symbolic and compact representation 
of the lifted analysis elements. More precisely, the elements of the lifted domain 
are decision trees, in which decision nodes are labelled with linear constraints 
over features, while leaf nodes belong to an existing single-program analysis do- 
main (e.g., some numerical domain [25] or the termination domain [31,32]). The 
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decision trees recursively partition the space of all variants (i.e., the space of 
possible combinations of feature’s values), whereas the program properties at 
the leaves provide analysis information corresponding to each partition (i.e., to 
those variants that satisfy the constraints along the path to the given leaf node). 
This way, the forward (numerical) lifted analysis partitions the given family 
into: “correct”, “incorrect”, and “I don’t know” (inconclusive) sub-families (sets 
of variants) with respect to the given assertions. The backward (termination) 
lifted analysis additionally partitions the “correct” sub-family with respect to 
the estimated number of execution steps to termination. Because of its special 
structure and possibilities for sharing of equivalent analysis results, the decision 
tree-based lifted analyses are able to converge to a solution very fast even for 
program families (sketches) that contain numerical features (holes) with large 
domains, thus giving rise to astronomical search spaces. This is particularly true 
for sketches in which holes appear in (linear) expressions that can be exactly 
represented in the underlying numerical domains used in the decision trees (e.g., 
polyhedra). In those cases, we can design very efficient lifted analysis with ex- 
tended (improved) transfer functions for assignments and tests. 

We have implemented our approach in a prototype program synthesizer, 
called FAMILYSKETCHER [17]. The numerical abstract domains (e.g., intervals, 
octagons, polyhedra) from the APRON library [23] are used as parameters of 
the underlying decision trees. FAMILYSKETCHER calls the Z3 SMT solver [26] 
to solve the optimization problem that represents the given quantitative objec- 
tive. We illustrate this approach for automatic completion of various numeri- 
cal C sketches from the Sketch project [29,30], SV-COMP (https://sv-comp. 
sosy-lab.org/), and the SyGuS-Competition (https://sygus.org/) [1]. We com- 
pare performances of our approach against the most popular sketching tool 
Sketch [29,30] and Brute-Force enumeration approach that checks for cor- 
rectness and optimality all sketch realizations one by one. 

In summary, this work makes the following contributions: (1) We combine 
forward and backward lifted analyses to resolve numerical program sketches with 
respect to both Boolean and quantitative specifications; (2) We implement our 
approach in the FAMILYSKETCHER. tool, which uses numerical domains from 
the APRON library as parameters and the Z3 tool for solving the underlying 
(linear) optimization problem; (3) We evaluate our approach and compare its 
performances with the Sketch tool and Brute-Force enumeration approach. 


2 Motivating Examples 


Let us consider the Loop1a sketch taken from SyGuS-Competition [1]: 


fa 
Q 


main() { 
int x := ??4, y := 0; 
while (®) (x >??2) { 
x i= x=15 
y := y+l;} 
assert (y > 2); //assert (y < 8); } 


©©©COOs 
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which contains two numerical holes, denoted by ??; and ??2. The synthesizer 
should replace the holes with constants from Z, such that the synthesized pro- 
gram satisfies the assertion at location (+) under all possible inputs. Moreover, 
we want to select the most efficient correct program, i.e. the one that terminates 
in the minimum number of execution steps. 

We transform the Loop1a sketch to a program family, which contains two 
numerical features A and B with domains [Min, Max] C Z. ' Since both holes in 
the Loopla sketch occur in (linear) expressions that can be exactly represented 
in numerical domains (e.g. intervals), the Loop1A program family is obtained 
by replacing the two holes ??; and ??2 with the features A and B. The total 
number of variants that can be generated from this family is (Max — Min +1)?, 
so that each variant corresponds to one possible sketch realization. We perform 
a forward numerical lifted analysis based on decision trees [16] of the Loopla 
program family. The decision tree (lifted numerical invariant) inferred at the 
location @) is shown in Fig. 1. Notice that the inner nodes of the decision tree in 
Fig. 1 are labeled with polyhedral linear constraints defined over feature variables 
A and B, while the leaves are labeled with polyhedral linear constraints defined 
over program and feature variables x, y, A and B. The edges of decision trees are 
labeled with the truth value of the decision on the parent node: we use solid edges 
for true (i.e., the constraint in the parent node is satisfied) and dashed edges for 
false (i.e., the negation of the constraint in the parent node is satisfied). Note 
that linear constraints in decision nodes implicitly take domains of features into 
account. For example, the decision node (A<B) is satisfied when (A<B) A (Min< 
A<Max) A (Min <B < Max). From the invariant inferred at location @) shown in 
Fig. 1, we can see that the given assertion (y > 2) may be valid in the leaf node 
that can be reached along the path satisfying the constraint =(A<B), i.e. (A-B > 
1). In fact, (y > 2) holds when the stronger constraint (A-B > 3) is satisfied. 
Thus, any variant that satisfies the above constraint (A-B > 3) represents a 
“correct” solution to the Loop1a sketch. To find a “correct & optimal” solution, 


1 Note that Min and Max represent some minimal and maximal representable integers. 
E.g., we may take Min = 0 and Max = 31 for 5-bit sizes of holes. 
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we perform a backward termination lifted analysis based on decision trees [13] of 
the Loop1a sub-family satisfying (A-B > 3). The decision tree representing the 
lifted ranking function of the above sub-family at initial location Q) is shown in 
Fig. 2. ? Notice that the leaf nodes represent affine functions defined over feature 
and program variables. We can see that the ranking function is: 3A-3B+4. We 
call the Z3 solver [26] to solve the following linear optimization problem: find 
values for A and B that minimizes the value of ranking function 3A-3B+4 over 
the constraint (A-B > 3) A (3A-3B+4 > 0). Minimizing this function gives us 
values for A and B that are desirable according to the quantitative criterion while 
satisfying the given assertion. The solution produced by Z3 is: A=3 and B=0 with 
the minimal objective 13. Therefore, the synthesizer reports this variant, i.e. 
program where ??;=3 and ??)=0, as a “correct & optimal” solution. 

We consider an alternative sketch of Loop1a, denoted by LOOP1B, in which 
the assertion in location @) is (y < 8). The numerical invariant inferred in 
location (5) is the same as for Loopla as shown in Fig. 1. However, there are 
now two solutions to the assertion (y < 8): (A < B) when the left leaf node is 
reached, and (1<A-B<7) when the right leaf node is reached. We perform two 
backward termination lifted analysis to find optimal solutions for both correct 
sub-families: (A <B) and (1<A-B<7). The lifted ranking function inferred at the 
initial location is given in Fig. 3. The solutions to the given optimization problem 
produced by Z3 solver are: A=0, B=0 with the minimal objective 4 for the case 
(A<B); and A=1, B=0 with the minimal objective 7 for the case (1<A-B<7). 

Let us consider the Loop2a sketch in Fig. 10. The lifted numerical invariant 
inferred at location @) is shown in Fig. 4. We can see that the assertion (y > 2) 
is valid for variants satisfying: (A-B > 1) A (3 < A < Max). The lifted ranking 
function inferred for this sub-family is shown in Fig. 5. It represents a piecewise- 
defined ranking function since it depends on the value of the input variable x. 
To represent graphically piecewise-defined ranking functions in decision trees, we 
use rounded rectangles to represent second-level decision nodes that are labelled 
with linear constraints defined over both feature and program variables. Thus, 
they partition the configuration and memory space, i.e. the possible values of 
feature and program variables (see Fig. 5). The obtained “correct & optimal” 
solution is: A=3 and B=0 with the minimal objective 3 when (x >10) and -3x+36 
when (x<10). Similarly, we can resolve the Loop2B sketch, where the assertion 
(y < 8) is considered. The “correct” variants satisfy: (A-B > 1) A (Min<A<7), 
and the “correct & optimal” solution is: A=1 and B=0 with the minimal objective 
3 when (x > 10) and -3x+36 when (x < 10). Note that, the inferred ranking 
functions for “correct” sub-families of Loop2a and LOOP2B in Figs. 5 and 6 do 
not depend on feature variables, so any “correct” solution is “optimal” as well. 

From the decision trees inferred by performing lifted analyses of our motivat- 
ing examples, we can see that the decision tree-based representation uses only 
one or two leaf nodes, although there are many variants in total. This possibility 
for sharing of analysis equivalent information corresponding to different variants 
confirms that decision trees are symbolic and compact representation of lifted 


? Termination analysis is backward, so the final result is reported in the initial location. 
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analysis elements. This is the key for obtaining efficient lifted analyses of pro- 
gram families with large configuration spaces, and thus for efficiently solving the 
quantitative sketching problem. 


3 Transforming Sketches to Program Families 


We now introduce the IMP language that we use to illustrate our work. We 
describe two extensions of IMP: IMP»; for writing program sketches, and IMP 
for writing program families. Finally, we define the transformation of sketches 
to program families and show its correctness. 


IMP. We use a simple imperative language, called IMP [27,25], for writing 
general-purpose single-programs. Program variables Var are statically allocated, 
and the only data type is the set Z of mathematical integers. Syntax is: 


s ::= skip | x:=ae | s; s | if (be) then s else s | while (be) dos | assert (be), 
ae ::= n | |n, n] | x | aeGae, be ::= aerae | abe | be A be | be V be 


where n ranges over integers Z, [n,n'] over integer intervals, x over program 
variables Var, ® € {+,—,*,/}, and me {<, <, =, #}. Intervals [n,n] denote a 
random choice of an integer in the interval. The set of all statements s is denoted 
by Stm; the set of all arithmetic expressions ae is denoted by AFzp; the set of 
all boolean expressions be is denoted by BExp. 

A program state o : X = Var > Z is a mapping from program variables 
to values. The meaning of boolean expressions [be] : X —> P({true, false}), 
arithmetic expressions [ae] : X + P(Z), and statements [s] : X > P(X), are 
defined by induction on their structure [27,25]. For example, the meaning of an 
arithmetic expression ae is a function from a state to a set of values: 


[n]o = in}, [n,n Jo = {n.h Ello = {e)}, 
[aeo @ aeyJo = {no @ nı | no € [aeo]o, nı € [aei]o} 


We write [s] for the set of final states that can be derived by executing s from 
some initial input state [27,25]. 
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IMP22. The language for sketches IMP}? is obtained by extending IMP with a 
basic hole construct, denoted by ??. The numerical hole ?? is a placeholder that 
the synthesizer must replace with a suitable integer constant. 


aen=...| ?? 


Each hole occurrence in a program sketch is assumed to be uniquely labelled as 
??; and has a bounded integer domain [n,n’]. We will sometimes write ggn] 
to make explicit the domain of a given hole. 

Let H be a set of holes in a program sketch. We define a control function 
o: = H —> Z to describe the value of each hole in the sketch. Thus, ¢ fully 
describes a candidate solution to the sketch. We write s® to describe a candidate 


solution to the sketch s fully defined by control function ¢. 


IMP. Let F = {Aj,...,An} be a finite and totaly ordered set of numerical 
features available in a program family. For each feature A € F, dom(A) C Z 
denotes the set of possible values that can be assigned to A. A valid combination 
of feature’s values represents a configuration k, which specifies one variant of a 
program family. It is given as a valuation function k : F —> Z, which is a mapping 
that assigns a value from dom(A) to each feature A € F. We assume that only a 
subset K of all possible configurations are valid. An alternative representation of 
configurations is based upon propositional formulae. Each configuration k € K 
can also be represented by a propositional formula: (A; = k(A1)) A... A (An = 
k(An)). The set of configurations K can be also represented as a formula: Vpexk. 
We define feature expressions, denoted FeatExp(F), as the set of propositional 
logic formulas over constraints of F generated by: 


0 ::= true | eF X er |=0 | 81 A 82 |01 V 02, epu=neZ|AECF | exer 


When a configuration k € K satisfies a feature expression 0 € FeatExp(F), we 
write k | 0, where — is the standard satisfaction relation. We write [6] to 
denote the set of configurations from K that satisfy 0, that is, k € [0] iff k H 0. 

The language for program families IMP is obtained by extending IMP with 
a new compile-time conditional statement for encoding multiple variants and a 
new arithmetic expression that represents a feature variable. The new statement 
“#if (0) s #endif” contains a feature expression 0 € FeatExp(F) as a presence 
condition, such that only if 0 is satisfied by a configuration k € K the statement 
s will be included in the variant corresponding to k. The syntax is: 


su=...| #if (0) s #endif, ae =... | AEF 


Any other preprocessor conditional constructs can be desugared and represented 
only by #if construct. For example, #if (0) so #elif (0’) sı #endif is trans- 
lated into the following: #if (0) so #endif ; #if (40 A^ 0’) sı #endif. Note that 
feature variables A € F can occur in arbitrary expressions in IMP, not only in 
presence conditions of #if-s as in traditional program families [21,24]. 

The semantics of IMP has two stages: first, given a configuration k € K 
compute an IMP single-program without #if-s and A € F; second, the obtained 
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program is evaluated using the standard IMP semantics [24]. The first stage 
is specified by the projection function mk, which recursively pre-processes all 
sub-statements and sub-expressions of statements. Hence, a,(skip) = skip, 
T(x: =ae) = x:=7,(ae), Tk(s;s') = 7e(S);7K(5’), Te(GePae’) = 7, (ae) O7z(ae’), 
and mk(ae ae’) = m,,(ae) I 7,(ae’). For “#if (0) s #endif”, statement s is 
included in the variant if k E 0, otherwise, if k jÆ 0 statement s is removed: ° 
tE(s) if k EO 
skip ifk 0 
function mk replaces A with the value k(A) € Z, that is 7(A) = k(A). 


me (#if (0) s #endif) = . For a feature A € F, the projection 


Transformation. We want to transform an input sketch s with a set of m holes 
ggm pena poimi] into an output program family 5 with a set of features 
A1,..., Am with domains [n1, ni], ..-, [Nm; nin]; respectively. The set of config- 
urations K in 5 includes all possible combinations of feature’s values. 

If a hole occurs in a (linear) expression that can be exactly represented in 
the underlying numerical abstract domain D, then we can handle the hole in a 
more efficient symbolic way by an extended lifted analysis. Given the polyhedra 
domain P, we say that a hole ?? can be exactly represented in P, if it occurs in an 
expression of the form: aj”, +...a;?? + ...QnZn + b, where ay,...,Qn,8 EZ 
and z1,... £n are program variables or other hole occurrences. Similarly, we 
define that a hole can be exactly represented in the interval J and the octagon 
O domains, if it occurs in expressions of the form: +?? + 8 and +x+ ?? + p, 
(where 8 € Z, x is a program variable or other hole occurrence), respectively. 

We now define rewrite rules for eliminating holes ?? from a program sketch 
8. Let s[?7!""']] be a basic (non-compound) statement in which the hole 270:7] 
occurs as a sub-expression. When the hole ??!"-”"] occurs in an expression that 
can be represented exactly in the numerical domain D, we eliminate ?? using 
the symbolic rewrite rule: 


s[77ln']] ~> sfA] (SR) 


Otherwise, if the hole 77!” occurs in an expression that cannot be represented 
exactly in the numerical domain D, then we use the explicit rewrite rule: 


s[77"'l] ~ #if (A=n) s[n] #elif ...#elif (A=n’-1) s[n/-1] #else s[n’]...#endif (ER) 


The set of features F is also updated with the fresh feature A. We write Rewrite(§) 
to be the resulting program family obtained by repeatedly applying rules (SR) 
and (ER) on a program sketch § to saturation. 


Example 1. Reconsider the Loopl1a and Loop2a sketches from Section 2. All 
holes ?? can be represented exactly in the interval domain, so we use the symbolic 
(SR) rule to obtain the program family. Consider the sketch: int x; while (x > 
0) x := ??*x+10. The hole ?? cannot be represented exactly in any numerical 
domain D. Thus, we use the explicit (ER) rule to obtain the program family. 


3 Since any k € K is a valuation function, we have that either k |= @ holds or k jÆ 0 
(which is equivalent to k | 70) holds, for any 6 € FeatExp(F). 
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The following result establishes the correctness of our transformation. It can 
be proved by structural induction on statements and expressions. 


Theorem 1. Let § be a sketch with holes ??1,...,??n, Q be a control function, 
and &® be a candidate solution of 8. Let 5 = Rewrite(8) be a program family, in 
which features Aı,..., An correspond to holes ??,,...,??n- We define a config- 
uration k € K, s.t. k(A;) = $(?%;) for 1<i <n. Then, we have: [8°] = [rk (8)]. 


4 Decision Tree-based Lifted Analyses 


In the context of program families, lifting means taking a static analysis that 
works on IMP single-programs, and transforming it into an analysis that works 
on IMP program families, without preprocessing them. In this work, we will use 
lifted versions of the (forward) numerical analysis [25] and the (backward) ter- 
mination analysis [31] from the abstract interpretation framework [9]. They will 
be used to infer numerical invariants and piecewise-defined ranking functions in 
all program locations. We work with lifted analyses based on the lifted domain of 
decision trees [16], in which the leaf nodes belong to an existing single-program 
domain (e.g., a numerical or termination domain) and decision nodes are linear 
constraints over feature variables. This way, we encapsulate the set of config- 
urations K into decision nodes where each top-down path represents a subset 
of configurations from K, and we store in each leaf node the analysis property 
generated from the variants corresponding to the given configurations. 


4.1 Abstract domain for decision nodes 


The domain of decision nodes Cp,, is the finite set of linear constraints defined 
over a set of variables V = {X,..., Xx}. Cp is constructed using the numerical 
domain D (see Section 4.2) by mapping a conjunction of constraints from D 
to a finite set of constraints in P(Cp). We assume the set of variables V = 
{X1,...,X,} to be a finite and totally ordered set, such that the ordering is 
Xi >... > Xk. We impose a total order <c, on Cp to be the lexicographic 
order on the coefficients a,,...,a, and constant &@kķ+1 of the linear constraints: 


(ay -Xy+...t+a0% - Xp +0K41 > 0) <Cp CA -Xı t.. +a, -Xp+a%,,>0) 
4= Jj > 0NVi < j.(ai = a4) A (aj < 0%) 


The negation of linear constraints is formed as: —(a1Xı +... ak Xk +82 0) = 

a,X,—...—a,X, — B — 1 > 0. For example, the negation of X — 3 > 0 
is —X +2 > 0. To ensure canonical representation of decision trees, a linear 
constraint c and its negation =c cannot both appear as decision nodes. Thus, we 
only keep the largest constraint with respect to <c, between c and ~c. 


4.2 Abstract domain for leaf nodes 


We assume the existence of a single-program abstract domain A defined over a 
set of variables V = {X1,...,X,}. The domain A is equipped with sound oper- 
ators for concretization ya, ordering C4, join La, meet Ma, bottom La, top T4, 
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widening Va, and narrowing ^q, as well as sound transfer functions for tests 
(boolean expressions) FILTERa, forward assignments F-ASSIGNa, and back- 
ward assignments B-ASSIGN»p. More specifically, FILTER,(a : A,be : BEzp) 
returns an abstract element from A obtained by restricting a to satisfy the test 
be; F-ASSIGNa(a: A,x:=e : Stm) returns an updated version of a by abstractly 
evaluating x:=e in it; whereas B-ASSIGNa(b : A,x:=ae : Stm) returns an ab- 
stract element from A that can lead to the abstract element b to hold after 
evaluating x:=ae. Note that a in F-ASSIGNa is an invariant in the initial loca- 
tion of x:=ae that needs to be propagated forward, while b in B-ASSIGNy is an 
invariant in the final location of x:=ae that needs to be propagated backwards. 
We will sometimes write Ay to explicitly denote the set of variables V over 
which A is defined. In this work, we will use domains A yar, Az, and Ayan. 


For the forward numerical analysis, we will instantiate A with some of the 
known numerical domains (D, Ep), such as Intervals (J,E;) [9,25], Octagons 
(O,Co) [25], and Polyhedra (P,Cp) [25]. The elements of J are intervals of 
the form: +X > 8, where X € V, p € Z; the elements of O are conjunctions of 
octagonal constraints of the form +X, + Xə > p, where X1,X2 €E V,8 E Z; 
while the elements of P are conjunctions of polyhedral constraints of the form 
aıXı +... +akXk + 6 > 0, where X1,... Xk E V,a1,...,a%, 8 E Z. 


For the backward termination analysis, we will instantiate A with the termi- 
nation decision tree domain T7(Cp,,,.,-,F.4), also written TT for short, intro- 
duced by Urban and Miné [31,32], where Cp,,,,,, is the domain for decision nodes 
and F4 is the domain of affine functions for leaf nodes. The elements of F 4 are: 
{ly, Tr} U {f : ZIVor-Fl = N | f(£1,..., En) = 1191 +... + mate +q}, where 
f € F1 is anatural-valued function of program and feature variables representing 
an upper bound on the number of steps to termination; the element Lp repre- 
sents potential non-termination; and Tp represents the lack of information to 
conclude. The leaf nodes belonging to F4\{Lr, Tr} and {Ly, Tr} represent de- 
fined and undefined leaf nodes, respectively. A termination decision tree t! € TT 
is: either a leaf node <f>> with f € Fa, or [d : tl’, tr’], where c’ € Cpu (de- 
noted by t.c) is the smallest constraint with respect to <p appearing in the tree 
t’, tl’ (denoted by t.l) is the left subtree of t’ representing its true branch, and 
tr’ (denoted by t.r) is the right subtree of t representing its false branch. The 
path along a decision tree establishes a set of program states and a set of configu- 
rations (those that satisfy the encountered constraints), and leaf nodes represent 
partially-defined ranking functions over the given program states and configu- 
rations. The transfer function B-ASSIGNpr (t’,x:=ae) substitutes the arithmetic 
expression ae to the variable x in linear constraints occurring within decision 
nodes of ¢’ and in functions occurring in leaf nodes of t’, whereas the transfer 
function FILTER rr (t’, be) generates a set of linear constraints J from test be and 
restricts t’ such that all paths satisfy the constraints from J. Finally, both trans- 
fer functions increment the constant q of defined functions f € F4\{Lr, Tr} in 
all leaf nodes of t’. 


We refer to [25,31] for a precise definition of all operations and transfer func- 
tions of intervals, octagons, polyhedra, and termination decision tree domain. 
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Algorithm 1: ASSIGNr(t,x:=e, C) when vars(ae) C Var 


1 if isLeaf(t) then return ASSIGN, (t, x: =e)X; 
2 else return [t.c : ASSIGNr(t.l, x:=e, CU{t.c}), ASSIGNr(t.r, x:=e, CU{~t.c})] ; 


4.3 Decision tree lifted domains 


We now define the decision tree lifted domain T(Cp,,Avarur), written T for 
short, for representing lifted analysis properties [16]. A decision tree t € T(Cp, A) 
is either a leaf node <a> with a € A, or [c : tl, tr], where decision node c € Cp 
(denoted by t.c) is the smallest constraint with respect to <c, appearing in the 
tree t, tl (denoted by t.l) is the left subtree of t representing its true branch, 
and tr (denoted by t.r) is the right subtree of t representing its false branch. 
The path along a decision tree establishes the set of configurations (those that 
satisfy the encountered constraints), and the leaf nodes represent their analysis 
properties. 


Operations. The concretization function yr of a decision tree t € T(Cp, A) re- 
turns ya(a) for k € K that satisfies the set C € P(Cp) of constraints accumulated 
along the top-down path to the leaf node a € A. 

The binary operations rely on the algorithm for tree unification [16,31], which 
finds a common labelling of decision nodes of two trees tı and t2. Note that the 
tree unification does not lose any information. All binary operations, including 
ordering Er, join Ur, meet Mr, widening Vr, and narrowing Ar, are performed 
leaf-wise on the unified decision trees. For example, the ordering tı Er t2 of two 
unified decision trees tı and tz is defined recursively as: 


Ka ErKae>= a, Ea a2, [c:th, tri] Cr [c: tle, tra] = (th Lr tla) A (try Cr tre) 
The top is: Tt =<Ta >>, while the bottom is: Ly =<La>. 


Transfer functions. We define lifted transfer functions for tests, (forward and 
backward) assignments (ASSIGN), and #if-s [16]. We consider several types of 
tests be and assignments x:=ae: when be and ae contain only program variables; 
and when be and ae contain both feature and program variables. 

Transfer function ASSIGN * for handling an assignment x:=ae in the input 
tree t, when the set of variables in ae is vars(ae) C Var, is implemented by ap- 
plying ASSIGN, leaf-wise, as shown in Algorithm 1. Similarly, transfer function 
FILTER? for handling tests be € BExp, when vars(be) C Var, is implemented 
by applying FILTER, leaf-wise. 

Transfer function ASSIGN? for x:=ae, when vars(ae) C VarU F, is given in 
Algorithm 2. It accumulates into the set C € P(Cp) (initialized to K) constraints 
encountered along the paths of the decision tree t (Line 2), up to the leaf nodes 
in which assignment is performed by ASSIGN,,,,.,.. That is, we first merge 


4 Note that ASSIGN is an abbreviation for both F-ASSIGN and B-ASSIGN. 
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Algorithm 2: ASSIGN7r(t,x:=ae,C) when vars(ae) C VarU F 


1 if isLeaf(t) then return ASSIGN, ynu (t Y varur C,x:=ae); 
2 else return [t.c : ASSIGNr(t.l,x:=e, CU{t.c}), ASSIGNr(t.r, x:=e, CU{At.c})] ; 


Algorithm 3: FILTERr(t, be, C) when vars(be) C VarU F 
1 if isLeaf(t) then 

2 a’ = FILTERA yaru (t Yvarur C, be); 

3 J=d lr; 

4 if isRedundant(J,C) then return Ka’; 
5 else return RESTRICT(<a’>, C, J\C); 


6 else return [t.c : FILTERr(t./,x:=e, CU{t.c}), FILTERr(t.r, x:=e, CU{at.c})] ; 


constraints from the leaf node t defined over Var U F and constraints from 
decision nodes C € P(Cp,) defined over F, by using & yarux operator, and then 
we apply ASSIGN,,,,,,- on the obtained result (Line 1). 

Transfer function FILTER, for test be, when vars(be) C VarU F, is described 
by Algorithm 3. Similarly to ASSIGN? in Algorithm 2, it accumulates the con- 
straints along the paths in a set C € P(Cp) up to the leaf nodes, and applies 
FILTERA wu- On an abstract element obtained by merging constraints in the 
leaf node and in C (Line 2). The obtained result a’ is a new leaf node, and 
additionally a’ is projected on feature variables using [+ operator to generate 
a new set of constraints J that is added to the given path to a’ by using the 
function RESTRICT [16] (Lines 3-5). The function isRedundant(J,C) checks if 
the constraints from J are redundant with respect to the set C. 

Finally, transfer function for #if directives is defined as: 


[#if (8) 8 #end]rt = [s]TFILTER7z(t,0,K) Ur FILTERT(t,—0,K) 


where [s]r(t) is transfer function for s and FILTERzņ(t, 0, K) is defined by Al- 
gorithm 3 since 0 contains only features. Transfer function for assertions is: 
ļassert(be)]r = FILTER (t, be, K). 

After applying transfer functions, the obtained decision trees may contain 
some redundancy that can be exploited to further compress them. We use several 
optimizations [16]. E.g., if constraints on a path to some leaf are unsatisfiable, 
we eliminate that leaf node; if a decision node contains two same subtrees, then 
we keep only one subtree and we also eliminate the decision node, etc. 


4.4 Decision tree-based lifted analysis 


Operations and transfer functions of T(Cp, D) and T(Cp, TT) are used to perform 
the numerical and termination lifted analysis of program families, respectively. 
The numerical lifted analysis derived from T (Cp, D), written as T” for short, is a 
pure forward analysis that infers numerical invariants in all program locations. 


114 A. S. Dimovski 


We define the analysis function [s];rt that takes as input a decision tree t 
corresponding to the initial location of statement s, and outputs a decision tree 
over-approximating the numerical invariant in the final location of s. The input 
decision tree t$, p at the initial location of a program family has only one leaf 
node Tdyau+ and decision nodes that define the set K. Lifted invariants are 
propagated forward from the initial location towards the final location taking 
assignments, #if-s, and tests into account with widening and narrowing around 
while-s. We apply delayed widening [9], which means that we start extrapolating 
by widening after a fixed number of iterations of a loop are analyzed explicitly. 


Similarly, we define the termination lifted analysis derived from T(Cp, TT), 
written as T? for short. It is a pure backward analysis that infers ranking func- 
tions in all program locations. We define the analysis function [s]rat that takes 
as input a decision tree t in the final location of statement s, and outputs a 
decision tree over-approximating the ranking function in the initial location of 
s. The input decision tree ea p at the final location of a program family has 
only one leaf node 0 (zero function) and decision nodes that define the set K. 
Lifted ranking functions are propagated backward from the final towards the 
initial location. 


We establish correctness of the lifted analysis based on T(Cp, A) by showing 
that it produces identical results with the Brute-Force enumeration approach 
based on the domain A. Let [s]7 denotes the transfer function of statement s 
of IMP in T(Cp, A), while [s]a denotes the transfer function of statement s of 
IMP in A. Given t € T(Cp, A), we denote by Pr;(t) € A the leaf node of tree t 
that corresponds to the variant k € K. 


Theorem 2. Pr;([s]r(t))=[7%(s)Ja(Pre()) for all KEK. 


Example 2. In Figs. 7 and 8 we depict decision trees at locations @) and @) in- 
ferred by performing (forward) numerical analysis based on the domain T(Cp, P) 
of the Loopla program family (see Section 2). In order to enforce convergence 
of the analysis, we apply the widening operator at the loop head, i.e. at the 
location Œ) before the while test. We can see how the invariant at location ©) 
shown in Fig. 1 is inferred from the invariant at location @). 


Subsequently, we perform a (backward) lifted termination analysis based on 
the domain T(Cp,T”) of the Loop1a sub-family satisfying (A-B > 3). Lifted 
decision trees inferred at locations @) and @) are shown in Figs. 9 and 2, re- 
spectively. We can see how by back-propagating the tree at location Œ), denoted 
t@ (see Fig. 9), via assignments y := 0 and x := A at location Œ), we ob- 
tain the tree at location ©), denoted t@) (see Fig. 2). The transfer function 
B-ASSIGNr(t@),x := A) will generate the tree t@ where x is replaced with A. 
The new decision node (A>B+1) and the leaf node with ranking function 2 are 
eliminated from ¢@) since they are redundant with respect to (A-B > 3). 
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A-B>3 
© > Bt 
a 
ly=0 A^ A=x ly=0 A A=x/A>xt+1Ax>BA A=xty 3x-3B+2 2 


Fig. 7: Invariant at loc. Fig.8: Invariant at loc. @) of Fig. 9: Ranking fun. at loc. 
@) of Loop1a. LOOPIA. @) of Loop1a. 


5 Synthesis Algorithm 


We can now solve the quantitative sketching problem using lifted analysis algo- 
rithms. More specifically, we delegate the effort of conducting an effective search 
of all possible sketch realizations to an efficient lifted static analyzer, which 
combines the forward numerical and the backward termination analyses. 

The synthesis algorithm SYNTHESIZE(s : Stm) for solving a sketch § is 
given in Algorithm 4. First, we transform the program sketch § into a program 
family 3 = Rewrite(§) (Line 1). Then, we call function [5] prt), p to perform the 
forward numerical lifted analysis of s. The inferred decision tree tr at the final 
location of 5 is analyzed by function FINDCORRECT (Line 3) to find the sets of 
variants for which non-_p and non-Tp leaf nodes are reachable. The set of vari- 
ants for which -Lp leaf node is reachable are “incorrect” with respect to the given 
assertions; whereas the set of variants for which Tp leaf node is reachable are “I 
don’t know” (inconclusive). For each non-_p and non-Tp leaf node, we generate 
the set of variants K’ C K that satisfy the encountered linear constraints along 
the top-down path to that leaf node as well as the given assertions. For each such 
“correct” set of variants K’, we perform the backward termination lifted analysis 
[slra t$ p. The inferred decision tree tg is analyzed by function FINDOPTIMAL 
(Line 7). It calls the Z3 solver [26] to solve the following optimization problem: 
find a model that minimizes the value of ranking functions t € TT, such that 
the linear constraints along the top-down paths to those leaf nodes are satisfied. 
More formally, given a decision tree t € T(Cp, TT), we define the function ¢[C]t 
that finds a set of pairs (k, t’) consisting of valid configurations k € K and the 
corresponding ranking function t’ € TT as follows: 


PICKS) ={(k,t') |kEK,k H C}, g[C]([e:tl, tr]) =9[C U {c}] (tl) U AC U {=e} (tr) 


The optimization problem is the following. Given a decision tree tg € T(Cp, TT) 
inferred at the initial location of 5, find a configuration k € K such that the cor- 
responding ranking function is minimal. That is, mingex{t’ | (k, t’) € o[K]tp}. 

The configuration k with the minimal ranking function found by Z3 is re- 
ported as a “correct and optimal” solution to the quantitative sketching problem. 
Theorem 3. SYNTHESIZE(§) is correct and terminates. 
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Algorithm 4: SYNTHESIZE(ŝ : Stm) 
3 = Rewrite(ŝ); 
tF = [sler thr; 
C = FINDCORRECT(tF); 
while C 49 do 
K’ = C.remove(); 
tg = [lre t$, B; 
sol.insert(FINDOPTIMAL(tpg)) 


© Noa aA WON 


return sol 


6 Evaluation 


We evaluate our approach for program sketching by comparing it with the 
Brute-Force enumeration approach and the popular Sketch tool. 


Implementation We have implemented our synthesis algorithm for quantitative 
program sketching [14] within the FAMILYSKETCHER tool [17]. It uses the lifted 
decision tree domain T(Cp, A), where A is instantiated either to numerical ab- 
stract domain D or to the termination decision tree domain TT. The abstract op- 
erations and transfer functions for the numerical domain D: intervals, octagons, 
polyhedra, are provided by the APRON library [23], while for the termination 
decision tree domain are provided by the Function tool [32]. The tool is written 
in OCAML and consists of around 7K LOC. The current tool provides a limited 
support for arrays, pointers, struct and union types. The only basic data type 
is mathematical integers, which is sufficient for our evaluation. 

Within the FAMILYSKETCHER, we have also implemented the Brute-Force 
enumeration approach that analyzes all variants (sketch realizations), one by 
one, using the single-program domains D and TT. 


Experiment setup and Benchmarks All experiments are executed on a 64-bit 
Intel® Core™ i7-1165G7 CPU@2.80GHz, VM LUbuntu 20.10, with 8 GB mem- 
ory, and we use a timeout value of 300 seconds. All times are reported as average 
over five independent executions. We report times needed for the actual analysis 
task to be performed. The implementation is available from [14]: https: //zenodo. 
org/record/5898643#. YhJLRejMLIU. We compare our approach with program 
sketching tool Sketch version 1.7.6 that uses SAT-based counterexample-guided 
inductive synthesis [30,29], and with the Brute-Force enumeration approach. 
The evaluation is performed on several C numerical sketches collected from the 
Sketch project [30,29], SV-COMP (https://sv-comp.sosy-lab.org/), and the 
SyGuS-Competition [1]. We use the following benchmarks: Loop1a and LOOP1B 
(Sec. 2), Loop2a and Loop2B (Fig. 10), LoopConp (Fig. 11), NESTEDLOOP 
(Fig. 12), vmca12004 (Fig. 13). 


Performance Results Table 1 shows the results of synthesizing our benchmarks. 
Note that Sketch reports only one “correct” solution for each sketch, which 
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void main() { void main(usigea int x) { 
int x; void main(unsigea int x){ ints:=0,y:=?71; void main(){ 
int y := ??1; int y :=0; int x0 :=x,y0:=y; int x :=?71,y:=0; 
while(x < 10) { while (x > 0) { while (x > 0) { while (x > 0) { 
if (y >??2) x := x+1; x:=x-l x f= x=; X := —??o*x+10; 
else x := x-1; } if (y<??) y :=y+1; while (y >??2) { y:=y+1; 
assert (y > 2); else y :=y-1; } y:=y-1; s:=s+1;} } 
//assert (y < 8); assert (y > 1); } assert (s > x0+y0); assert (y < 2); 
} } } } 


Fig. 10: Loop2a. Fig. 11: LoopConp. Fig. 12: NesrepLoop. Fig. 13: vmca12004. 


does not have to be “optimal” with respect to the given quantitative objective. 
FAMILYSKETCHER and Brute-Force use the polyhedra domain as parameter. 

The Loopla and Loop1B sketches are handled symbolically by (SR) rule. 
Thus, our approach does not depend on sizes of hole domains. FAMILYSKETCHER 
terminates in (around) 0.016 sec for Loop1a and in 0.026 sec for LOOP1B. In 
contrast, Brute-Force and Sketch do depend on the sizes of holes. Sketch 
terminates in 37.74 sec (resp., 2.44 sec) for 16-bits sizes of holes for Loopla 
(resp., LOOP1B). It times out for bigger sizes of Loop1la. Sketch often reports 
“correct & optimal” solutions for both sketches. Similarly, our tool can handle 
symbolically Loop2a and LOOP2B in 0.060 sec and 0.047 sec. However, Sketch 
cannot resolve them, since it uses 8 unrollments of the loop by default. If the loop 
is unrolled 11 times, Sketch terminates but often reports the empty solution. 

The LoopConp sketch contains one hole that can be handled symbolically 
by (SR) rule. FAMILYSKETCHER has similar running times for all domain sizes 
reporting the solution ?? > 2 and ranking function 4x+8. In contrast, Sketch 
resolves this example only if the loop is unrolled as many times as is the size 
of the hole and inputs (e.g., 32 times for 5-bits). Hence, Sketch’s performance 
declines with the growth of size of the hole, and times out for 16-bits. 

The NestepLoop sketch contains two holes that can be handled symbolically 
by (SR) rule. FAMILYSKETCHER terminates in (around) 0.126 sec for all sizes 
of holes. The “correct” solution is (??1 — ??2 > 0) A (Min <??2 <1), while the 
“correct & optimal” solution is (??; = ??2 = 0) with ranking function 13. On 
the other hand, Brute-Force takes 65.03 sec for 5-bit size of holes and times 
out for larger sizes, while Sketch cannot resolve this benchmark. 

The vmcai2004 sketch contains two holes. The first one ??1 is handled sym- 
bolically by (SR) rule while the second one ??, explicitly by (ER) rule. The 
performance of FAMILYSKETCHER depends on the size of ??2. The decision tree 
inferred in the location before the assertion contains one leaf node for each pos- 
sible value of feature B (features A and B represent ??; and ??2). The sub-family 
of “correct” solutions is: (1 < A < Max) A (B > 10), while the “correct & opti- 
mal” solution is (A=1) A (B=10) with ranking function 6. Sketch scales better in 
this case reporting one “correct” (but not “optimal”) solution. However, FAM- 
ILYSKETCHER still outperforms the Brute-Force approach. 
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Table 1: Performance results of FAMILYSKETCHER vs. Sketch vs. Brute-Force. 
FAMILYSKETCHER and Brute-Force use Polyhedra domain. All times in sec. 


5 bits 6 bits 16 bits 


Bench. FaMmiLY Sketch Brute |Famiy Sketch Brute /Famuy Sketch Brute 


SKETCHER Force |SKETCHER Force |SKETCHER Force 


LooP1A 0.016 0.192 4.66 10.017 0.197 21.33 |0.017 37.74 timeout 
LOOP1B 0.026 0.203 4.77 |0.026 0.216 21.38 |0.027 2.44 timeout 
Loop2a 0.060 0.200 8.66 |0.060 0.202 42.81 |0.061 0.348 timeout 
LOOP2B 0.047 0.203 8.45 10.047 0.205 36.04 |0.049 0.521 timeout 
LoopConD (0.042 0.207 1.19 (0.042 0.209 2.56 |0.043 timeout timeout 
NESTEDLooP|0.126 timeout 65.03 |0.126 timeout timeout|/0.128 timeout timeout 
vmMcar2004 |4.69 0.192 5.12 15.52 0.229 19.12 |timeout 0.292 timeout 


Discussion In summary, we can see that FAMILYSKETCHER often outperforms 
Sketch, especially in case of sketches that can be handled symbolically by (SR) 
rule. But, for sketches with holes that need to be handled by (ER) rule, the per- 
formances of our tool decline, which is the consequence of the need to explicitly 
consider all values of those holes. However, even in this case FAMILYSKETCHER 
scales better than Brute-Force. This is due to the fact that Brute-Force com- 
piles and executes the fixed-point iterative algorithm once for each variant, while 
our approach does it once per whole family plus there are still possibilities for 
sharing. Moreover, FAMILYSKETCHER reports the “correct & optimal” solution, 
while Sketch reports the first found “correct” solution. 


Threats to validity The current tool has only limited support for arrays, pointers, 
struct and union types. However, the above features are largely orthogonal to 
the solution proposed here. In particular, these features complicate the seman- 
tics of single-programs and implementation of domains for leaf nodes, but have 
no impact on the semantics of variability-specific constructs. We perform lifted 
analysis of relatively small benchmarks. However, the focus of our approach is 
to combat the realization space blow-up of sketches, not their LOC size. So, we 
expect to obtain similar or better results for larger benchmarks. Although we 
analyze relatively small set of benchmarks, we expect the results to carry over 
the other benchmarks. 


7 Related Work 


The proposed program sketcher uses numerical abstract domains as parame- 
ters, so it can be applied for synthesizing programs with numerical data types. 
The existing widely-known sketching tool Sketch [29,30], which uses SAT-based 
counterexample-guided inductive synthesis, is more general and especially suited 
for synthesizing bit-manipulating programs. However, Sketch reasons about 
loops by unrolling them, so is very sensitive to the degree of unrolling. Our 
approach being based on abstract interpretation does not have this constraint, 


Quantitative Program Sketching using Lifted Static Analysis 119 


since we use the widening extrapolation operator to handle unbounded loops 
and an infinite number of execution paths in a sound way. This is stronger than 
fixing a priori a bound on the number of iterations of loops as in the Sketch 
tool. Moreover, Sketch may need several iterations to converge reporting only 
one solution. On the other hand, our approach needs only one iteration to per- 
form lifted analysis reporting several, and very often all, “correct” solutions. 
This is the key for applying our approach to solve the quantitative sketching 
problem. Another work for solving a quantitative sketching problem is proposed 
by Chaudhuri et. al [6]. The quantitative optimum they consider is that the ex- 
pected output value on probabilistic inputs is minimal [5]. They use smoothed 
proof search and probabilistic analysis to implement this approach in the FER- 
MAT tool built on top of Sketch. In contrast, in this work the quantitative 
optimum we consider is that the worst-case behavior of the program is minimal. 

Recently, there have been proposed several works that solve the sketching 
synthesis problem using product line analysis and verification algorithms. Ceska 
et. al. [4] use a counterexample guided abstraction refinement technique for an- 
alyzing product lines to resolve probabilistic PRISM sketches. The work [17] uses 
a (forward) numerical lifted analysis based on abstract interpretation to resolve 
numerical sketches. We extend here this approach by considering the more gen- 
eral quantitative sketching problem, where we additionally employ a (backward) 
termination lifted analysis to find a solution that is not only “correct” but also 
“optimal” to the given quantitative objective. 

Several lifted analysis based on abstract interpretation have been proposed 
recently [24,11,12,16,18,15,13] for analyzing program families with #if-s. Midt- 
gaard et. al. [24] have proposed the lifted tuple-based analysis, while the work 
[11,12] improves the tuple representation by using lifted binary decision diagram 
(BDD) domains. They are applied to program families with only Boolean fea- 
tures. Subsequently, the lifted decision tree domain has been proposed to handle 
program families with both Boolean and numerical features [16,18], as well as 
dynamic program families where features can change during run-time [15]. The 
above lifted analyses are forward and infer numerical invariants, while a back- 
ward termination analysis for inferring ranking functions is proposed in [13]. 

Decision-tree abstract domains have been used in abstract interpretation 
community recently [10,7,32]. Segmented decision tree abstract domains have 
enabled path dependent static analysis [10,7]. Their elements contain decision 
nodes that are determined either by values of program variables [10] or by the 
if conditions [7], whereas the leaf nodes are numerical properties. Urban and 
Miné [31,32] use decision tree abstract domains to prove program termination. 


8 Conclusion 


In this work, we proposed a new approach for synthesis of program sketches, 
such that the resulting program satisfies the combined boolean and quantitative 
specifications. We have shown that both reasoning tasks can be accomplished 
using a combination of forward and backward lifted analysis. We experimentally 
demonstrate the effectiveness of our approach on a variety of C benchmarks. 
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Abstract. Probabilistic programming aims to open the power of Bayesian 
reasoning to software developers and scientists, but identification of problems 
during inference and debugging are left entirely to the developers and typically 
require significant statistical expertise. A common class of problems when 
writing probabilistic programs is the lack of convergence of the probabilistic 
programs to their posterior distributions. 

We present SixthSense, a novel approach for predicting probabilistic pro- 
gram convergence ahead of run and its application to debugging convergence 
problems in probabilistic programs. SixthSense’s training algorithm learns a 
classifier that can predict whether a previously unseen probabilistic program 
will converge. It encodes the syntax of a probabilistic program as motifs 
— fragments of the syntactic program paths. The decisions of the classifier 
are interpretable and can be used to suggest the program features that con- 
tributed significantly to program convergence or non-convergence. We also 
present an algorithm for augmenting a set of training probabilistic programs 
that uses guided mutation. 

We evaluated SixthSense on a broad range of widely used probabilistic pro- 
grams. Our results show that SixthSense features are effective in predicting 
convergence of programs for given inference algorithms. SixthSense obtained 
Accuracy of over 78% for predicting convergence, substantially above the 
state-of-the-art techniques for predicting program properties Code2Vec and 
Code2Seq. We show the ability of SixthSense to guide the debugging of conver- 
gence problems, which pinpoints the causes of non-convergence significantly 
better by Stan’s built-in warnings. 


Keywords: Probabilistic Programming - Debugging - Machine Learning 


1 Introduction 


Probabilistic programs (PP) express complicated Bayesian models as simple computer 
programs, used in various domains [22, 38, 44, 54], including the important applica- 
tions like epidemic modeling [23] and single-cell genomics [42]. Probabilistic languages 
extend the conventional languages with constructs for sampling from probabilistic dis- 
tributions (prior), conditioning on data, and probabilistic queries, such as the distribu- 
tion reshaped by conditioning on the data (posterior) [26]. Probabilistic programming 
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systems (PP systems) compile the programs and compute the results using an efficient 
inference algorithm, while hiding the intricate details of inference. Most practical 
inference algorithms are non-deterministic and approximate. For instance, Markov 
Chain Monte Carlo (MCMC) algorithms [28, 40, 48] run a probabilistic program 
multiple times (each of which is referred to as an iteration) to sample data points from 
the posterior distribution. They drive today’s popular PP systems, such as Stan [9]. 

MCMC algorithms have a nice theoretical property: in the limit, the samples 
they generate come from the correct posterior distribution. But, in practice, a user 
can only execute the algorithm for a finite time budget and hence needs to fine-tune 
the algorithms to balance between quality of inference and execution time. This 
complicates development: the programmer needs to write the program in a way 
that interacts well with the algorithm and select some parameters specific for the 
inference algorithms. For instance, inference may fail to properly initialize, silently 
produce inaccurate results, or generate non-independent samples from the posterior 
distribution. Even identifying and afterward resolving these challenges currently 
requires significant statistical expertise. 

An important property for successful inference is convergence, since non-convergence 
is often a cause of inaccurate (or wrong) result. Convergence means the samples gen- 
erated by the inference algorithm represent the target distribution. While there exists 
metrics for convergence (e.g. Gelman-Rubin diagnostic [25]) in statistic literature, 
there lacks a comprehensive study of what model features could cause non-convergence. 
Thus, getting a data-driven understanding of the causes could help developers to 
debug the non-convergence issues, and does not require expert knowledge. Moreover, 
the existing convergence diagnostics are not predictive — they cannot be determined 
ahead of time i.e. without running the program. Building prediction model for con- 
verges ahead of time would save the time to run programs (often taking minutes or 
more). It would also enable a faster program debug/update cycle. 


1.1 SixthSense 


We present SixthSense, the first approach for identifying convergence problems in 
probabilistic programs ahead-of-run. SixthSense adopts a learning approach: its trains 
a classifier that can, for a previously unseen probabilistic program and its data, predict 
whether the program will converge in a specified number of steps (for a given threshold 
of Gelman-Rubin diagnostic). The decisions of the classifier are interpretable and can 
be used to suggest which program features leads to the convergence/non-convergence 
of the program. 

To train such a classifier, SixthSense needs to overcome several challenges that are 
beyond the big-code techniques studied for conventional languages [4, 5, 31, 37, 47]. 
First, probabilistic programs are small (20-100 lines of code) compared to conventional 
programs but their execution is complicated, with conditioning statements for data 
and non-standard semantics that performs Bayesian inference. Second, due to their 
relative novelty, there are few publicly-available probabilistic programs that can be 
used for training. Finally, we should be able to interpret why the programs are 
predicted to convergence or non-convergence in order to guide developers to debug 
the non-convergence issues. 
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Representing Structural, Data, and Runtime Features: To learn a classifier, we 
embed the syntactic and semantic program features in a numerical vector. To encode 
program structure, we observe that many snippets of code in probabilistic programs 
form patterns (sampling from distributions, hierarchical models, relations between 
variables) that may repeat within the single program or across programs. We identify 
those patterns as motifs — fragments of probabilistic program code, consisting of several 
adjacent abstract-syntax-tree nodes (e.g., neighboring statements or expressions). 
SixthSense learns the set of features from the subset of motifs it identifies in the 
code. It groups together similar motifs by calculating a low-dimensional representation 
of the motifs using randomized discrete projections [8]. This way, it can balance 
the accuracy of prediction and the size of the learned models. We also engineered 
a set of data features (e.g., means, variances) and the runtime features — diagnostics 
from early warmup iterations that the inference algorithms compute as they exe- 
cute. These features cannot be learned by the approaches that focus on static code 
features [4, 5, 31, 47]. 
Mutation-Based Program Generation: We present a novel technique based on 
program and data mutations that produces a diverse set of probabilistic programs 
with a good balance between converging and non-converging programs, with the goal 
to augment the training set. Our technique takes a set of seed programs as input, 
analyzes them and applies a set of pre-defined mutations which aim to change the 
semantics of generated programs. To obtain better diversity, our algorithm identifies 
(via locality-sensitive hashing [6]) and discards any mutant that is too similar to the 
one that was generated before. 
Interpretable Predictor Results: For problem diagnosis and debugging of proba- 
bilistic programs, it is important to be able to interpret why the algorithm predicted 
non-convergence. Our learning algorithm leverages random forests for this task. It 
relates the likely cause of non-convergence to specific statements or expressions 
in the program code. 


1.2 Results 


In this work, we learn the classifiers for convergence of three popular classes of 
probabilistic programs: Regression, Time Series, and Mixture Models. We obtained 
166 seed programs, across the three classes, from an open source repository of Stan 
programs [52]. For each class, SixthSense generated more than 10,000 mutants. We 
train our classifiers for multiple thresholds of the convergence score (Gelman-Rubin 
diagnostic) to evaluate the sensitivity of our classifiers. 

Our evaluation shows the effectiveness of SixthSense in predicting convergence 
of probabilistic programs compared to two state-of-the-art learning algorithms for 
conventional code: Code2Vec [5] and Code2Seq [4]. We measure the prediction quality 
via Accuracy (ratio of sum of True Positives and True Negatives to total tested 
programs), Precision (ratio of True Positives to total classified as Positives) and Recall 
(ratio of True Positives to total actual Positives). Here True Positive is a program that 
is predicted to converge and it indeed converges; the others are defined analogously. 

SixthSense obtains an average Accuracy score across the three model classes of 
78% for convergence prediction (with almost equally high prediction and recall). Sixth- 
Sense, with just code features outperforms Code2Vec [5] by 8 percentage points on 
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average and Code2Seq [4] by 5 percentage points on average (for a tight convergence 
threshold). Moreover, we also show that Accuracy scores increase to over 83% when 
adding runtime features obtained after just the first 10-200 samples from the warmup 
stage of the inference algorithm (which is less than 10% of its run-time). SixthSense 
also has higher precision for all model classes, and recall higher than Code2Vec but 
similar to Code2Seq. SixthSense’s prediction time is less than a second and the model 
size is modest — less than 20 MB, which is 25-37% smaller than Code2Vec/Code2Seq. 

We further demonstrate, by studying 40 non-converging programs, that SixthSense 
can pinpoint the locations in the code that cause non-convergence for 29 programs. In 
contrast, Stan’s runtime warnings point to non-convergence causes in only 5 programs. 


1.3 Contributions 


We highlight the main contributions of this paper: 


x SixthSense System!. SixthSense is a system for learning to predict convergence 
of probabilistic programs that aids programmers in pinpointing and understanding 
the sources of convergence problems in PPs. 

x Predicting convergence of probabilistic programs. We present the first 
approach for learning predictors for convergence of probabilistic programs based 
on encoding the structure of probabilistic programs using code motifs. 

x Program generation for training set augmentation. We present a new muta- 
tion algorithm for augmenting the training set with PPs that have diverse structural 
and runtime characteristics. 

x Experimental evaluation. We show that SixthSense predicts convergence for 
three popular classes of programs, with higher accuracy, precision, and recall than 
two state-of-the-art approaches. In our case study SixthSense helps pinpoint likely 
cause of non-convergence for 29 out of 40 non-converging programs, compared to 
5 programs for which Stan’s runtime warnings help. 


2 Example 


We describe how SixthSense computes motifs, trains the predictor and demonstrate 
how we can use it to guide the debugging of probabilistic programs. Figure 1 shows 
two variants of a Mixture model in Stan. A Mixture Model is a probabilistic model 
that assumes that each observed data point comes from one out of N independent 
sub-distributions of values. Each sub-distribution has an associated probability (called 
mixing ratio) of being chosen. 

The programs A and B in Figure l(a), 1(b) have several (unknown) parameters: 
mean mu and variance sigma of the normal sub-distribution; theta is the mixing 
ratio of the sub-distributions and p/ is an auxiliary parameter. The programs also 
access the array of observations, y, of size K. Each observation in y is assumed 
to be sampled from one of these two sub-distributions: a normal distribution (as 
normal_Ipdf) or a uniform distribution (as the constant 0.5). For the program B, 


1 SixthSense is publicly available at https://github.com/uiuc-arc/sixthsense. 
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Fig. 1: An example of two models with different convergence behaviors. We obtain the 
features from the Abstract Syntax Tree (AST) of source code and data (not shown here). 
We use them as inputs to the trained Random Forest Model for predicting the label 
(Converging/Not Converging). We can also obtain the most important features which likely 
contributed to (non)-convergence. 


consider a novice developer, who was confused about Stan’s target statement [51], 
calculated the negative likelihood instead. 


When run with Stan’s default NUTS inference algorithm for 1000 iterations, the 
program A converges and the program B does not converge. Our goal is to predict, 
before running the programs, whether they will converge. If they do not converge, 
we would also want to know why and use this information to debug the program. 


Feature Extraction. First, we extract different classes of features for each program 
in the corpus of mutants. These include motifs — fragments of the AST, augmented 
with data features, and run-time features. To extract motifs, we parse each program 
and construct an AST. Then, starting from each node, we obtain all AST paths 
of length L by traversing the ancestors of the node. Figures 1(c) and 1(d) present 
one sub-tree for the function call statement(in loop) in the programs A and B 
respectively and several motifs that SixthSense extracts. The elements in the motif 
are the sequence of the node type IDs as feature vectors. 


A good learning algorithm should be able to combine similar motifs and operate 
only on groups of them. To identify such groups of motifs, we apply random discrete 
projections, a well-known technique for reducing the dimensionality of the feature 
space. It maps the feature vectors of the IDs onto a hash value with a much smaller 
dimension. The random projections algorithm has a distance-preserving property, 
which means that the similar vectors (even when they are not grouped together) 
will have similar low-dimensional representations. This property allows us to apply 
standard learning algorithms on this low-dimensional representation while preserving 
the similarity of the original motifs. 


Computing Reference Solutions and Labels. To compute the program labels 
(Le., ‘converging’, ‘not-converging’), SixthSense runs them for the default 1000 it- 
erations using Stan’s MCMC algorithm (NUTS). For convergence, we calculate a 
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well-known diagnostic called Gelman-Rubin (Å) statistic [25]. If the È statistic is 
within a certain bound (close to 1.0), it indicates that the program converged. 


Training. Given a sufficient number of training programs, SixthSense extracts the 
features and gets the labels for convergence. SixthSense then generates precise and 
interpretable predictors. We build separate models for predicting convergence for each 
model class, since models in three classes are significantly different in both semantics 
and the way they interact with inference algorithms. The model classes are easy to 
identify for users without expertise or through simple analytical tools. 


Prediction. We use the classifier trained using the batch of Mixture Models for conver- 
gence. We use a threshold of 1.05 for Gelman-Rubin diagnostic (a very tight bound). 
SixthSense correctly predicts True label for program in Figure l(a) and False label 
for program in Figure 1(b). The total time required for computing the features and 
doing the prediction for a single program is less than a second, compared to 53 
seconds on average to run a program. 


Interpretation and Debugging. Our combination of random projections — which 
groups very similar motifs together, even if they appear at different locations in the 
program — and the random forest classification — which can easily explain its decisions 
— proves effective in identifying the parts of the program that impede convergence. 
Namely, we can employ SixthSense’s random forest classifier to identify top features. 
When SixthSense predicts non-convergence, the user can debug the program according 
to the top features. 


Now consider the scenario where a novice Stan developer used negative log- 
likelihood in Stan’s target statement, and wrote program B (Figure 1(b)). SixthSense 
predicts that B does not converge, and gives the topmost feature as the path segment 
(motif) starting from the negative sign to the parameters in the log-likelihood calcu- 
lation (function log_mix). Figure 2 presents this motif. There were three such motifs 
in program B (one for each argument of the log_mix function), highly contributing 
to non-convergence prediction. In contrast, this motif is missing from program A 
(Figure 1(a)), and thus has negatively contributed in the converging prediction. This 
observation validates our earlier intuition about the cause of difference in the nature 
of two programs and is correctly inferred by our prediction model. 


It is intuitive for the user to fix a non-converging 
program by altering program code that corresponds to 
the top features. For program B, after the topmost motif 
indicates the location that contributes to non-convergence, 
removing the negative sign would allow program to con- 
verge. After applying the change, the user can use Sixth- 
Sense to predict again, or even iteratively search for a 
good fix. This iterative debugging would be much faster Plog mix(parami, ...) 
than running through the full compilation and execution Fig. 2: Topmost motif in 
with Stan. At the same time, SixthSense can provide program B 
more directed warning messages. 


<39,76,47,10,54> 
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3 Overview 


Figure 3 shows the architecture of SixthSense. We next describe each of its components. 
Feature Computation. SixthSense’s 
features can be broadly divided into 
three major groups: (1) automatically- 
selected AST (Abstract Syntax Tree) 
based features - motifs - which represent 
fragments of the AST; (2) Data Features, p ig. 3: SixthSense Training Workflow 
and (3) runtime features of the inference 

algorithm. We present our feature selection and summarization in Section 4. 
Program Generation. The generator uses the input set of seed programs to generate 
a batch of mutants. We use two sets of transformations to mutate the program: 
(1) Expansive Mutations produce more complex models compared to the original 
ones (e.g., add a new parameter), and (2) Reducing Mutations simplify the models 
by simplifying arithmetic expressions, removing conditional statements, etc. Our 
adaptive mutator uses nearest neighbor algorithms to efficiently explore the feature 
space of the programs. We explain the mutations and the algorithms in Section 5. 
Program Runner. It runs each generated mutant and collects several statistics 
such as samples from MCMC iterations and runtimes. 

Metric Calculator. Typically, the MCMC algorithms provide samples for each 
parameter from the posterior distribution. The metric calculator computes the 
convergence for each parameter using the samples from the posterior. 

Model Trainer. Using the syntax, data and runtime features and metrics computed 
by the previous components, the Model Trainer builds a machine learning model 
for predicting the behavior of probabilistic models for the given inference algorithm. 
Here, we used Random Forest Classifier. 

We build models to predict, for given metric thresholds, (1) Convergence of the 

models using static features of model and data, (2) Convergence of the models using 
static features and run-time diagnostics from initial phases of sampling, and (3) 
Predict iteration count for which the model will converge. 
Deploying the Trained Model. Once the trainer produces the model, we can use 
it to predict the convergence of new programs. For a given program and its dataset, 
SixthSense runs the feature extractor, runs it through the predictor and outputs 
the convergence label. It also reports on the features that contributed most to the 
prediction, and relates them back to the source code. 


4 Learning Program Features 


We present the description of the programs and SixthSense’s approach for collecting 
code, data, and runtime. 

Probabilistic Programs Syntax. A probabilistic program is an imperative program 
with additional constructs for sampling from distributions, conditioning the model on 
observed data values, and one or more queries for either the posterior distribution or 
expected value of a parameter. In this work, we use a subset of syntax of Storm-IR [19] 
for representing probabilistic program, as shown in Figure 4. 
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Type ::= Int | Float 
x € Vars $ 

Decl =a: Type | x : [c7] 

c € ConstsU{—00,0o} 

Expr := c | x | Expr aop Expr | Expr bop Expr 
aap E the elt bop Expr 
bop € {=,>,...} Stmt = «a = Expr | Decl | observe(Dist(Expr™ ) ,x) 

~? i eet . 
Dist € {Normal,Uniforn,...} | £ ~ Dist(Exprt) | for x € 1..n;{Stmt*} 
ID € String | if (Expr) then Stmt* else Stmt* 
Query ::= posterior(x) | expectation(a) 
Program ::= Stmt* Query* 


Fig. 4: Syntax of Storm-IR, [19] 


Representing Program Paths. To understand the causes of non-convergence and 
for better debuggability, we select a representation that is easy to train and interpret. 
Existing approaches Code2Vec/Code2Seq [4, 5] aim to predict variable names through 
natural-language semantics, and they encode the path between any two terminal 
nodes in the Abstract Syntax Tree (AST). Instead, we encode the sequences of AST 
nodes with limited length to pinpoint the semantic issues. We formalize our notions: 

Definition 1. (Abstract Syntax Tree) Similar to [5], we define an AST for a program 
P as a tuple < N,T,X,s,6,¢,~ >. N is a set of non-terminal nodes, T is the set of 
terminal nodes, X is a set of values, s € N is the root node, 6: N— (NUT) is a 
function which maps each non terminal node to a list of its children, 6:T— X isa 
function which maps each terminal node to some value, and Y:N —>N maps each 
non-terminal node to a unique natural number. 

Definition 2. (AST Path) An AST path is a path between the nodes in the AST, 
which starts from one non-terminal node and ends at another non-terminal node, 
passing through the ancestors of each node at each step. 

Definition 3. (Motifs) A Motif encodes an AST path from a node passing through 
the ancestors of length up to L. For a given AST Path : (N1, N2,..., NL), where 
Ni € 6(Ni4i), Vie 1..L—1, we can define the motif as the list: (11,Jy,...,[,), where 
Im=U(Nm) Vm € 1..L. 


4.1 Extracting Features from Programs 


Motivation. Two major challenges in efficiently encoding the motifs in a feature 
vector include (1) the large numbers of different paths that a program may have, and 
(2) the variability of length between different paths. A general approach to solve both 
problems is to design a flexible scheme for dimensionality reduction, which encodes 
the rich structures, like our motifs as a smaller set of program properties. 

We rest our approach on two observations. First, despite a huge number of possible 
syntactic paths, similar motifs repeat often in a single program and across multiple 
programs. Therefore, we need to think only about the subsets of all possible paths 
that appear in the corpus of programs. Second, the variability between motifs is 
often local, and many similar (though not-identical) motifs may lead to the same 
program behaviors. Therefore, instead of encoding each motif in the feature vector 
independently, we can group similar motifs and encode only the group. 

To reduce the dimensionality of available paths and group together similar motifs, 
we use Random Discretized Projections (RDP) [8], hashing technique for reducing 
dimensionality of large feature vectors. It is well-known in data mining, not been used 
for big-code representation. RDP calculates hash values that are used to group similar 
items into the same buckets with high probability based on a similarity metric (e.g. 
cosine similarity). The hash value represents the motif-group in the feature vector. 
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Extracting Features from Individual Programs. Line 5-9 in Algorithm 1 
describes the procedure to extract motifs from a program. We iterate over the 
nodes in the AST and for each node, to extract a sequence of nodes by visiting the 
parent nodes up to level L, using the function GetMotifAt (line 6), which we define 
recursively as GetMotifAt(N,L) = N :: GetMotifAt(parent(N),L—1) and base cases 
GetMotifAt(2,L)=@ and GetMotifAt(N,0)=9. 

The function SimilarityHash (line 7) computes a hash key of each motif using the 
Random Discretized Projections (RDP) [8]. If the size of the motif is smaller than L 
(e.g., because the node does not have sufficient number of parents), PadRight pads 
the motif to the maximum size with unused elements. We increase the count for the 
hash each time a similar motif has obtained the similar hash function (line 8). The 
RDP has a flexible number of projections and the size of bins. These parameters can 
be tuned to make similarity more or less fine-grained. They also control, indirectly, 
the size of the feature vector, the construction of which we describe next. 
Calculating Feature Vectors. Algorithm 1 Compute Feature Vectors 
Given a batch of programs Batch - 

K ' Input: Batch of Programs Batch, Motif depth L 

and the motif length L, we iterate Output: Feature Vectors F 

over the batch to extract the mo- 
tifs for each program (line 5-9), as 
described in the paragraph above. 
Then, to store all the motifs, we 
first use Init VTable to create a 

feature vector table F whose col- To e ee e 


umn length is equal to the number 10: Finder InitF VTable( Batch, batchMotifs) 


of programs and the row length is 11: for proge Batch do _ 
‘ 12: for mE batchMotifs(prog) do 
equal to the number of unique mo- 13: F[prog][indez(m)] + batchMotifs(prog)[m] 


tifs (features) across all programs 14: return F 

in the batch (line 10). Each row of 

F is the feature vector of the program prog, and each cell stores the count of a motif 
m in prog (line 11-13). index maps between the motif hash code and the column 
index in F. Finally, we output all the feature vectors. 


procedure CALCULATEFEATURES 
batchMotifs<@ 
for prog€ Batch do 
progMotifCount= {0,...,0} 
for node€ nodes(AST) do 
m< GetMotifAt(node,L) 
h¢ SimilarityHash(PadRight(m,L)) 


4.2 Data Features 


The nature of the data-set may determine the performance of the probabilistic model 
when run using an inference algorithm. For instance, in absence of sufficient data, 
the choice of prior distributions become very important. Similarly, a strong prior 
with very small variance is unlikely to converge to the correct results in such a 
scenario [2]. SixthSense computes data metrics like sparsity (number of non-zero 
elements), auto-correlation (correlation between values of a time series), skewness 
(asymmetry of the distribution), maximum/minimum variances of the model’s prior 
distributions, and several others for observed and predictor data variables. 


4.3 Runtime Features 


For inference algorithms like MCMC, diagnostics from the early stages (warmup) of 
sampling can often indicate the presence or absence of problems with the model and 
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associated data. Such diagnostics can help in discovering problems earlier so that 
the users can update their model for more efficient performance. Unfortunately, they 
are not predictive in nature: manually observing the raw values may not provide a 
good intuition about the program execution. However, our prediction engine can infer 
useful information from them. 

To validate this intuition, we collect several runtime features from MCMC chains 
during the early stages of warmup iterations. These features are algorithm specific. For 
NUTS, they include posterior log density (log probability that the data is produced 
by the model using current set of the parameters), tree depth, divergence of the 
simulated trajectory, acceptance rate of the generated sample, step-size (the distance 
between consecutive samples), leapfrog steps, and energy estimate of Hamiltonian. 


5 Program Generation for Training Set Augmentation 


In this section, we describe our approach of generating mutant programs from a 
corpus of seed programs. To produce mutants from the original seed programs, we 
define two kinds of transformations — for code and data. 


5.1 Code Mutations 


Our Code Mutations can be broadly classified into two sets: (1) expansive muta- 
tions, which make more complicated models from the original one, and (2) reducing 
mutations, which reduce the complexity of the models. 

Expansive Mutations. These include Auxiliary Parameter Creator which converts 
a distribution argument to a parameter in the program, Conjugate Replacer which 
replaces prior distributions with distributions conjugate [46] to the likelihood when 
possible, Dimension Expander which expands the dimension of a scalar parameter 
to match the data dimension, Constant Replacer lifts a constant in the program to 
a parameter with an appropriate distribution, and Data to Parameter Transformer 
randomly replaces a real valued data array with a parameter with the same dimension. 
Reducing Mutations. The transformations include Arithmetic Simplifier, which 
replaces arithmetic expressions with either of the operands or changes the arithmetic 
operation, Conditional Eliminator which replaces conditional statements with either of 
the branches Distribution Simplifier which replaces complex distributions like Laplace, 
Weibull with common distributions like Normal or Uniform, Math-Function Call 
Eliminator which replaces common math functions like log, exp, etc. with constants. 
These transformations have been previous used by [19] for testing PP systems. 


5.2 Data Mutations 


Apart from source code transformations, we also added several data transformations. 
Such transformations help in changing the distribution of values in the data set, which 
could produce challenging scenarios for the probabilistic model or inference algorithm 
to work with. The data mutations include scaling by a constant, adding arbitrary noise, 
Box-Cox transformation [49], scaling to new mean and standard deviation, cube root 
transform, and random replacement of values with values from the same data set. 
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5.3 Adaptive Algorithm for Mutant Generation 


To generate programs with different runtime behaviors, it is important to explore 
programs with diverse semantic and syntactic features. Our mutation algorithm 
randomly applies several mutations to the original program. However, to diversify 
the generated mutants it uses a nearest neighbor based algorithm (Locality Sensitive 
Hashing [12]), which only selects a representative set of mutants in multiple rounds. 


Algorithm 2 Selecting Mutants Algorithm 3 Generating Mutants 
Input: Seed Programs S, Programs M, BatchSize Input: Seed program S, Programs M, 
B Max Changes C 
Output: Program Set progs Output: Program Set progs 
procedure SELECTMUTANTS procedure GENERATEPROGRAMS 
rdp + InitializeLSH() progs+— 0 
progs+ 0 iO 
while |progs| <M do while i< M do 
for s€ S do pep 
seed+ chooseSeed(s,progs) for t€{1..C} do 
p+ Generate Programs(seed, B) m+ chooseMutation() 
for ke p do p' <—m.mutate(p’) 
fu feature_vector(k) if p'~p then 
if rdp.neighbours(fv) <1 then progs+ progs.append(p’) 


soe am icip 
a a return progs 
return progs 


Algorithm 2 presents the mutant selection algorithm. The inputs for the algorithm 
are seed programs S$, total number of programs to generate M, and the number 
of programs to generate in each batch B from each seed program. The algorithm 
returns the selected mutant programs set progs as output. First, we initialize the LSH 
(Locality Sensitive Hashing) engine. We used four Random Discrete Projections hash 
functions. Next, in each round, we first choose a seed program using the chooseSeed 
function. The chooseSeed function randomly chooses among the original seed program 
s and the mutants generated (in progs) from it in earlier rounds. Next, we generate a 
new batch of programs of size B using generatePrograms. 

For each new generated program k, we compute its feature_vector and number 
of neighbors among the already generated programs. We select the program only if 
it has no neighbors in the already selected set of programs. Finally, the algorithm 
returns the selected set of programs once it has generated the target M programs. 
The generatePrograms algorithm (Algorithm 3) generates M mutants for a seed 
program S. For each program, in each iteration, it applies a set of randomly chosen 
mutations and adds it to the set of new programs. Finally, it returns the set of new 
programs to the caller. Using this algorithm, we obtained a diverse set of probabilistic 
programs with a good balance of converging/non-converging behavior. 


6 Methodology 


We present the methodology for collecting seed probabilistic models and the program 
features and metrics we compute. 
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Seed Probabilistic Models. We collected a corpus of probabilistic models from 
the most comprehensive open-source repository of Stan Models [52] ?. Out of total 
505 models, we selected the three most common categories: Regression (120 models), 
Time-Series (23), and Mixture Models (23, augmented with 3 from [33]). The models 
come with their datasets. 

Inference Engine and Sampling. NUTS, the default inference engine of Stan [24]. 
We executed all programs using 4 MCMC chains with 1000 iterations each for warmup 
phase and sampling. This configuration is default for Stan. We also checked the 
eventual convergence by running the programs for many more iterations. We used 
100,000 as the maximum number (the convergence metrics do not change significantly 
even for 10° iterations for the seed models). 

Feature Extraction. We used a Python based implementation of Randomized 
Discretized Projection [1]. We set its hyper-parameters P=5 and bin-width B=5, 
which worked well to reduce the dimensionality of the vector space. 

Random Forests. We used Random Forests Classifier from Scikit-Learn package 
in Python for training. We use 5-fold cross validation for training. We extract top 
features using TreeInterpreter [56]. 

Execution Setup. We performed the mutant generation and feature computation 
on an Intel Xeon 3.6 GHz machine with 6 cores and 32 GB RAM. We used Azure 
Batch Scheduling Service to run all the programs and metrics computations. We 
capped the MCMC execution under 240 minutes. 


6.1 Baselines, Metrics, and Classification 


Baselines. We compare SixthSense to three baselines: The first, Code2Vec [5], 
and the second, Code2Seq [4], are state-of-the-art predictors based on Deep Neural 
Networks for big-code. They were originally used to predict function names from 
code. We adapted these systems to do classification for each threshold of convergence, 
by extracting path contexts (subsets of paths similar to our motifs) form the code. 
Finally, the third baseline, the majority classifier assigns the most likely label during 
the training to all the predicted programs. It indicates the prediction *hardness’ 
when the training set is disbalanced. 

Metrics. We used a common metric for measuring convergence, called the Gelman- 
Rubin (R) [25] diagnostic. Ideally, the value of this metric should be close to 1.0. If the 
observed value of R is e.g., 1.05 it is considered as good indication of convergence. The 
larger values, e.g., 1.5 and greater, are considered as weaker evidences for convergence. 
Given the threshold, we assign the label True to a program if the metric value is 
within the threshold and False otherwise. 


2 The number of publicly available probabilistic programs in public sources is low, compared 
to conventional languages. This is in part due to the novelty of these languages and 
expertise required to create and interpret those models. As a further challenge, Stan 
programs require the corresponding data set of sufficient size, which many Stan programs 
on Github do not have. Finally, most of publicly available programs are tuned to converge 
to their available data-sets. 
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6.2 Evaluation Experimental Setup 


Training and Test Sets. We generate a corpus of mutants programs for each seed 
program using the approach discussed in Section 5.3. We create a test-train split 
for every seed program in the following way: (1) Test set consists of a single seed 
program and all its mutants; (2) Training set contains all other seeds and mutants. 
Thus, the training is not aware of any mutants of the test seed program. For each 
such split, we train a classifier using the training set and evaluate its performance 
(using the metrics below) on the test set. With this strategy we obtain metrics for 
each split (each representing one seed program and its mutants). Finally, we compute 
the average performance across the splits. 

Training a predictor by leaving out each model and its mutants in test set allows 
us to stress-test the model predictor. We choose this evaluation strategy because 
the number of original seed programs in each class is low compared to conventional 
big-code data-sets. Every seed probabilistic program represents a different statistical 
model and using this strategy helps us evaluate the sensitivity of the classifiers for 
each such model. 

Classification Scores. We used Precision, Recall, Accuracy, and AUC [21] to evaluate 
the performance of the learned classifier. They range between 0 and 1 (higher better). 
We use the same metric for all the baselines. 

Accuracy and AUC are adequate metrics for our scenario: Since we perform train- 
ing by creating a test-train split for every seed program and its mutants (Section 6.2), 
in some cases the test-set can become imbalanced, e.g. no or few positive labels/no 
true and false positives or extremely different sizes of the splits. 


1.0, i 1.0 1.0 i ; 
E Majority E Code2Vec Ill Code2Seq E Majority Bl Code2Vec Ùl Code2Seq i Majority W Code2Vec W Code2Seq 


] 
E SixthSense Ml SixthSense+RT ost E SixthSense Ml SixthSense+RT oot E SixthSense Ml SixthSense+RT 
| 
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a) Regression models ) Mixture models (c) TimeSeries models 


Accuracy Score 
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ae 5: SixthSense Prediction race for Convergence (Measured Using Gelman- 
Rubin Diagnostic) 


7 Evaluation 


7.1 Predicting Convergence of Inference 


Figure 5 presents the prediction scores for SixthSense when predicting convergence 
of MCMC algorithms (NUTS in this case). The Y-axis shows the accuracy scores 
for each prediction model (higher is better). The X-axis shows the four thresholds 
(1.05-1.2) of the convergence metric, Gelman-Rubin diagnostic, that we considered in 
our evaluation. We chose this range to test how general the prediction can be as the 
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Table 1: Precision (P) and recall (R) (R=1.05) Table 2: AUC scores (R=1.05) 


Class 6s-AST | Code2Vec|Code2Seq Class 6s|6s+-RT]Code2Vec 

P RIP R P R Regression |0.82 0.88 0.73 
Regression |0.71 0.71/0.63 0.69 |0.66 0.72 Mixture 0.84 0.90 0.74 
Mixture 0.77 0.74|0.67 0.67 |0.67 0.72 Time Series|0.86 0.89 0.79 
Time Series|0.79 0.75|0.69 0.74 |0.74 0.77 


individual program labels change. For each threshold, we plot the accuracy scores of 
our prediction model (SixthSense) together with Code2Vec, Code2Seq and a Majority 
Label Classifier, as vertical bars in different colors. We evaluated the trained model 
on a held-out test set (see Section 6.2). 

Comparison with Code2Vec/Code2Seq. Figure 5 shows that SixthSense, with 
solely AST motifs is better than Code2Vec and Code2Seq (see also the ablation study 
in Section 8). The results show that SixthSense’s learned classifiers have an accuracy 
score close to 0.8. These prediction rates are already useful for the user because it 
helps them avoid wasting time for compiling and running programs which would 
likely not converge. Our training algorithm is able to learn classifiers that generalize 
well across different thresholds. 

For Regression and Mixture models, SixthSense has consistently better accuracy 
than the other approaches across all thresholds. For the tightest convergence bound 
R= 1.05, its accuracy is by 5 percentage points higher than the alternatives for 
Regression, and 8 percentage points higher for Mixture. For TimeSeries models, the 
accuracy scores of SixthSense is by 1 percentage point higher than Code2Seq. 

Table 1 presents the precision and recall for R = 1.05. SixthSense exhibits 
consistently higher precision over Code2Vec (8 to 10 percentage points) and Code2Seq 
(5 to 10 percentage points). SixthSense also has higher recall than Code2Vec (1 to 
7 percentage points), while the recalls of SixthSense and Code2Seq are comparable 
(within 2 percentage points). Recall that the precision/recall are averaged over those 
for different splits and can be more sensitive to small and unbalanced splits. 

Table 2 shows the AUC scores for SixthSense, SixthSense with runtime features and 
Code2Vec. Code2Seq does not provide its probability of predictions, which prevents 
us from computing its AUC score. The results show that SixthSense improves in 
AUC score over Code2Vec for all classes. 

The prediction accuracy, prediction, and recall from Tables 1 and 2 persist for 
higher thresholds of R. 

Comparison to Majority Label Classifier. Figure 5 shows the comparison of 
SixthSense to a naive Majority Label Classifier, which has the classification accuracy of 
0.5. It indicates the significant level of improvement of SixthSense over the uninformed 
random choice. 

Predicting with Warm-up Runtime Features. Figure 5 presents the impact of 
SixthSense’s AST features augmented with runtime features (Section 4.3) sampled 
from the first 200 iterations of the warmup stage (at this point Stan still does not 
issue warnings for our programs). Recall, the results of these iterations are dropped 
by the inference algorithm, as in this phase the mixing of the MCMC chains has just 
begun. However they can be useful in addition to code features: they help improve 
the prediction by further 6 percentage points for Regression and Timeseries, and 8 
percentage points for Mixture models (R=1.05).Table 2 also shows the improvement 
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in AUC of both AST and Run-Time features over the AST-only version of SixthSense. 
However, note that collecting run-time features still requires compiling the program 
and starting its execution. While this time differs among the systems and datasets, it 
may be non-trivial, as is the case for Stan (e.g. around 30 seconds for compilation). 
This time may be an important factor when deciding to use a runtime-predictor for 
different PP systems. We also present a feature ablation study in Section 8. 


7.2 Debugging Non-Converging Programs 


When SixthSense’s learned model predicts that a model will not converge, two natural 
follow-ups are (1) ask which part of the program is likely culprit for non-convergence 
and (2) how many iterations would be sufficient to run the model to converge, if it 
converges. 

Debugging Approach. We interpret the outcomes SixthSense predicts, and leverage 
the AST features and the random forests to help pinpoint which part of the program 
leads to non-convergence. 

To obtain the set of programs, we randomly selected 40 probabilistic programs 
from our test sets, equally across the three model classes, which SixthSense correctly 
identified as non-converging for 1000 iterations. For each program, we obtained the 
most important features from the learned random forest. We selected top-5 features 
(motifs) and inspected the model to identify whether the parts of the motifs contains 
the culprit of non-convergence. The top-5 features typically only cover 5% of all the 
motifs, which means SixthSense points to a relatively small scope to debug. 

We make up to two manual updates to each model by making changes only to 
the AST elements identified by the motifs or the referenced observed data. These 
changes represent simple semantic modifications that a user of probabilistic program 
might make as they explore various possible models for their data. We simulate a 
try and check interactive search with these localized transformations. For instance, 
SixthSense identified a constant array in a regression equation as one of the top 
motifs. Converting that constant into a parameter made the model converge. Some 
of our attempted updates include changing the variance (constant) of a distribution, 
changing the distribution for a parameter, changing a parameter to a constant, and 
removing mathematical functions (e,g. abs, log) when they are redundent. 

After transforming the model, we run inference to see if it converges. We further 
check if the model become accurate (or correct) after the fix, since non-convergence 
often causes inaccurate (or wrong) result. For each model, we apply accuracy tests 
from Bayesian model checking [25, Ch.6]: we compute the mean squared error to 
compare the new model result to its correct data and also do visual inspection on the 
result density plot to check if it matches the correct distribution. Multiple student 
authors inspected the updates and agreed that these changes followed the protocol 
described above. 

Results. Table 3 presents the results for this debugging application. Column 1 
(Class) presents the classes of randomly sampled models. Column 2 (#4Models) 
presents the number of mutant models we randomly selected from each class. Column 
3 (6s Upd.) presents the number of programs that we manually updated to converge 
using the method above. Column 4 (Stan Warn.) presents the number of programs 
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which Stan issued a warning during sampling. Column 5 (Stan Upd.) presents 
the number of programs for which Stan’s warnings helped update the program to 
converge. 


Overall, we were able to . i 
identify the problem and let 29 Table 3: Debugging Non-Converging Models 


; Warn. |Stan Upd. 

updated models converge out of £ lasa” -Models |96 Upd: Stan Warn: Stan Up 
: legression 14 11 4 2 
40 models. Specifically, we cor- Mixture 13 9 4 1 
rected 16 models by replacing _TimeSeries ig 3 4 z 


a parameter indicated by SixthSense with a constant; corrected 6 by simplifying 
mathematical functions, 3 by changing constants in distributions, 2 by converting 
constants to parameters, and 2 by changing distributions for parameters. All the code 
elements we changed were pointed by top three motifs SixthSense returned. For 11 
models that we were not able to update, we believe that the model correction would 
require more complex changes than those we specified in setup above. 

Out of 29 updated, now converging models, we ran SixthSense again. It correctly 
predicted that 21 will converge (with 8 from Regression, 8 from TimeSeries and 5 
from Mixture); this is, interestingly, close to the prediction rates from Section 7.1. 
This illustrates that SixthSense can be useful in the iterative debugging loop. 

These results demonstrate the advantage of interpretability SixthSense’s learned 

model. Using motifs from the AST as features and a simple learning model (random 
forests) helps the user easily identify key program components which affect the 
runtime behavior of a probabilistic model. In comparison, identifying such important 
features is hard for other complex neural network-based models and might require 
more low-level handling of the learned model. In particular, Code2Vec and Code2Seq 
do not provide a way to interpret how their prediction worked. 
Comparison to Stan’s runtime warnings. Compared to Stan’s runtime warnings, 
SixthSense motifs reveal more fine-grained patterns that hinder convergence. For most 
of the non-converging models (29 out of the 40 in this experiment), Stan did not issue 
a warning (beyond the low R value at the end of inference) The 12 warnings issued 
by Stan only have regards to function domains. Seven out of 12 were not related to 
non-convergence. For instance, one program returns “Warning: normal_lpdf: Scale 
parameter is -0.0799029, but must be >0.” Changing the scale parameter limits does 
not help. Instead SixthSense identifies the fix that is not at this location. 

The remaining 5 Stan runs indicate non-convergence and can help with updating 
the model. However, they were not as helpful in locating the causes as SixthSense. 
One example where both SixthSense and Stan indicated problem is in the program 
with the expression normal(exp(w0) + sqrt( abs(w1)) *#1+w2*«#2,s). Stan warned 
about the overflow in the first argument of normal, disregarding its sub-expressions. 
SixthSense traced the problem to the sqrt and abs sub-expressions that indeed helped 
fix the non-convergence, by simplifying the function expressions. 


8 Sensitivity Analysis 


We present various sensitivity analyses of SixthSense to justify our design choices. 
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Table 4: Ablation Study (R=1.05) Table 5: Training w. Noisy Labels (R=1.05) 


Class A|A+D|]A+RT|/A+D+RT Label Flip Pr. 1% 3% 5% 
Regression |0.77| 0.77 0.83 0.83 Model Class R B R B R B 
Mixture 0.78] 0.78 0.87 0.87 Regression 0.765 0.760/0.763 0.765|0.760 0.764 
TimeSeries|0.79}| 0.79 0.84 0.85 Mixture 0.772 0.784|0.774 0.782|0.783 0.785 
TimeSeries 0.786 0.789 |0.794 0.781|0.781 0.788 


8.1 Feature Ablation Study 


Table 4 shows the Accuracy score for convergence predictions when trained with 
different combinations of feature groups (AST features, AST and data features, and 
all features). Runtime features are from 200 warmup iterations. The AST features 
(motifs) alone contribute a major portion to the Accuracy scores in all cases. Data 
features do not have much impact on these models. Runtime features, after a certain 
number of iterations further improve prediction (they are in fact a strong predictor, 
but do not establish a relation with the program code). Obtaining runtime statistics 
comes at a cost of compiling and running the program. This cost is often over 30 
seconds for Stan. 

Impact of the noisy labels on the prediction. To evaluate the robustness of our 
prediction, we perturb the class labels in the training set with different noise levels. 
We use the version of SixthSense, which applies Rank Pruning [41]. Table 5 shows 
the Accuracy scores for the different model classes for several noise levels (1-5%). 
For each noise level, Robust column shows the scores when trained using the Rank 
pruning algorithm and Basic column shows the scores for baseline SixthSense. Even 
in the presence of significant training noise, our learning approach maintains high 
Accuracy scores. For instance, the performance of Mixture Models remains almost 
constant (close to 78%) even when 5% labels are wrong. 

Other sensitivity studies. We also performed other sensitivity studies on the 
features and generated programs. First, we looked at different motif sizes. For three 
motif sizes (5, 10, 20) on the threshold R= 1.05, we do not see a significant increase in 
the Accuracy score. This reflects that even smaller motifs obtained from probabilistic 
programs can be very effective for predicting their runtime behavior. Therefore, we 
used Motif size of 5 in all our experiments. 

We then removed overlapped motifs, which resulted in the reduction of the 
Accuracy scores (by 2 to 5 percentage points). Other experiments, such as different 
LSH configurations to remove syntactically similar programs from the training set 
did not show substantial deviation from the reported scores. 


9 Related Work 


Probabilistic Programming. Probabilistic programming languages (PPLs) and 
their underlying inference systems have recently gained significant interest from 
research and industry [9, 10, 26, 27, 29, 36, 38, 45, 55, 58]. Tyically, PPLs (e.g., Stan) 
only provide simple runtime diagnostics and timing information as they run. In con- 
trast, SixthSense is a predictive data-driven approach that complements these efforts. 

The prior debugging approach for PPLs [39] requires augmenting Bayesian network 
representation with additional labels and requires extending the inference algorithm. 
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However, its applicability is limited since state-of-the-art PP systems typically do not 
use Bayesian network representation. Our approach learns program features useful for 
debugging without modifications to the inference algorithm. Existing tools [15, 19] 
find lower-level implementation bugs in probabilistic programming systems. 

Several recent approaches have explored the nature of regression tests in probabilis- 
tic and machine learning applications such as the causes and fixes for flaky tests [17, 18], 
usage of seeds in tests [14], and speeding up expensive regression tests [16]. 
Predicting Program Properties from Big-Code. Much attention has recently 
been devoted to uses of machine learning to analyze and predict various program 
properties. Notable examples include predicting variable names/types via statistical 
program models [47], predicting patches [35], summarizing code [3, 31], and API 
discovery [5, 57]. However, all of these works apply learning on conventional programs 
(C/Java/Javascript), obtained from massive code repositories. Moreover, many of 
these approaches predict static program properties (e.g., names/types), rather than 
execution properties like convergence. While some of these approaches benefit from 
the natural-language semantics of identifiers [4, 5], we are interested in semantics of 
the program itself, which are better represented by the sequence of AST nodes. 

We also present how to augment the corpus of programs with diverse programs 

via guided mutation. While our approach bears similarity to data augmentation in 
machine learning [11, 50, 53], probabilistic programs have complex structure defined 
by many syntactic (and often semantic) rules. 
Predicting Algorithm Performance. Researchers developed machine learning ap- 
proaches that predict hardness of NP-hard problems (e.g., SAT, SMT, ILP) [7, 32, 34]. 
These works are complementary and their syntax and semantics are considerably 
simpler than for probabilistic programs. Researchers also proposed models for perfor- 
mance of other machine learning architectures [13, 20, 30, 43], but their techniques 
and applications are orthogonal to ours. 


10 Conclusion 


We presented SixthSense, a novel approach and system, which predicts convergence 
for probabilistic programs and helps guide the debugging of convergence issues. We 
show SixthSense is effective in extracting features from probabilistic programs and 
learning a prediction model. Compared to the state-of-the-art techniques, our results 
show significant improvement in accuracy. 
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Abstract. Finding semantic bugs in code is difficult and requires pre- 
cious expert time. Lacking comprehensive formal specifications, deduc- 
tive verification is not an option. We propose an incremental specification 
procedure: With the help of automatic verification tools, a domain expert 
is guided through program runs and source code locations. The expert 
validates a run at certain locations and creates lightweight annotations. 
Formal methods training is not required. We demonstrate by example 
that this approach is capable to quickly detect different kinds of seman- 
tic bugs. We position our approach in the middle ground between fully- 
fledged deductive verification and bug finding without semantic guidance. 


1 Introduction 


The main obstacle against using program verification tools for bug finding is 
not their efficiency, but a lack of meaningful formal specifications that capture 
the intended semantics of a given program [2,9]. This is unfortunate, because 
semantic bugs are dominant over memory-related bugs [15], but cannot be found 
by existing bug finding approaches [1,3,6,7,18], which look for syntactic patterns 
or generic errors (such as uncaught exceptions, memory faults, etc.). 

A notorious, relatively recent example was an alleged error in a software 
used in the UK to send mammography invitations to women in a certain age 
group [11]. Not all letters were sent according to the specification, which would 
statistically have led to belated diagnosis and possibly premature death of some 
women. As it turned out, the specification was drawn up in hindsight, after 
the software had been in use for years. To detect the mismatch would have 
required an expert to look at exactly the right decision points in the code and 
to compare the implicit assumptions with the specification. This is a challenging 
task: (i) There might be a vast number of inputs and runs: how to choose the 
ones that give insight into a possible semantic bug? (ii) Keeping track of implicit 
assumptions and to check their validity in a given run is tedious, time-consuming, 
and error-prone. 

In this paper we propose a novel approach to help experts finding semantic 
bugs: These are bugs where functional and expected behavior in a domain con- 
text deviate, without domain-independent symptoms like abrupt termination or 
blocking. We address the issues above by dedicated tool and language support. 
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The main ingredients are: (1) to render implicit assumptions in the code explicit, 
traceable, and automatically checkable in the form of lightweight (Boolean JAvA) 
specification expressions and a simple labeling mechanism; (2) to use (automatic) 
deductive verification to guide the expert and to validate assumptions. 

We do not aim at a fully automatic process, which we deem futile to detect 
non-trivial semantic bugs. We also are not interested in complete contract-based 
specification [13] as typically used in deductive verification [9], which we consider 
unrealistic in many cases, because of the required effort and the need for training 
in writing formal specifications. In contrast, the partial specifications we aim at 
are incrementally produced by a software engineer, guided by tool support. The 
annotations do not cover the full functionality of the analyzed software, but only 
part of the input space and source code. Therefore, the resulting annotations can 
stay simple and close to a designer’s understanding of the code. Specific training 
in formal methods is not required. 

The flip side of our approach is that we are unable to provide formal guar- 
antees about the absence of bugs. This is in common with other bug-finding 
technology, such as systematic debugging [18], bug finding tools [1], test case 
generation [7], or code inspection [6]. On the other hand, all of the mentioned 
techniques either look for a fixed set of syntactic conditions or assume the pres- 
ence of a specification, whereas we guide the user to come up with semantically 
relevant specification annotations. Consequently, we hope to occupy a sweet spot 
between fully-fledged deductive verification and bug finding without semantic 
guidance. In addition, partial specifications can help static verification as well 
as deductive verification tools. In this NIER paper we sketch our approach and 
illustrate how it works with simple examples. A robust implementation and full 
evaluation is envisaged. 


2 Validating Program Runs 


To explain our approach we use the min method shown in Fig. 2. More realistic 
examples are provided in Section 4. We are given a software system and its 
source code. The system could be a method, a command line interface, or some 
other piece of software in which we want to find bugs. We assume the system is 
already free of memory-related or termination bugs (covered by existing tools)— 
in particular, for any given input, there is no runtime exception and the system 
terminates in some final state. We will have a (virtual) code position for all final 
states: This is where the validation routine will start, see below. 

Our validation process is performed by a domain expert, guided by a software 
assistant. A domain expert knows how the system should behave. In general, 
there is no (formal) specification, so we need an expert, who might be a software 
tester, code reviewer, or debugging specialist. We assume the expert understands 
source code and is able to validate the behavior of a given program run. 

Program runs are supplied in the beginning. These could have been collected 
from log files while the system was running. We can often reconstruct a run, if 
all inputs (and events) are given. We assume the runs cover potential semantic 
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bugs, the more runs, the better. Although the expert could validate every single 
run, it is unrealistic to look into all of them—there are far too many. 


2.1 Syntax 


We illustrate our approach with JAVA source code, but it is applicable to any 
other (imperative) programming language with suitable tool support. 

The expert will stepwise annotate the software under validation with partial 
specifications. We contribute a simple annotation syntax. Annotations are placed 
after //@ or between /*@ and */, compatible with JML [12]. We do not use full 
JML, but only a fragment consisting of the labeled assertions and assumptions 
produced by the grammar in Fig. 1. The asserted/assumed expressions Expr 
are also simplified: A domain expert only needs to write side effect-free boolean 
JAVA expressions—quantifiers or other JML constructs are not required. 

Assumptions and assertions are labeled using a prefixed identifier ALabel 
inside < and >. Labeled assumptions/assertions are only effective when explicitly 
referred to—they are not assumed in general. To make such references, we extend 
assert statements with the keyword assuming. In program (4) of Fig. 2, the 
assertion labeled aRes holds when assuming aGb. 

The syntax allows an assertion to be assuming a logical combination of 
(other) labeled assertions/assumptions. A conjunction of ALabels is written as 
<11,12,...>. Any positive combination of ALabels in positive disjunctive nor- 
mal form (PDNF) is supported: One can build a complex (acyclic) graph of as- 
sumptions and assertions depending on each other. We will see that the PDNF is 
naturally obtained by the validation steps. The PDNF also makes checking/ver- 
ifying assertions easy, see Sect. 3. 

Our annotations bear resemblance to [4,5], where the keyword verified is 
used instead of assuming. In contrast to [4,5], labeled assumptions are not ez- 
pected to hold true in every run (in Fig. 2, assumption aGb is true for half of 
the inputs). Hence, in our setting, there is no point in trying to verify assump- 
tions. Instead, to check or verify a claim, we can use a labeled assertion with 
assuming <>, that is, an assertion without an assumption. For example, we 
write assert x>0 assuming <> instead of assume x>0 to assume and verify 
that variable x is greater than 0. 

When the system under validation reaches a termination point, we are not as- 
serting any specific claim, however, usually a number of assertions must hold be- 
fore the (virtual) termination point. These are listed in an OnlyAssuming clause. 
If the system boundaries are given by a method (as in Fig. 2), the OnlyAssuming 
declaration is placed before the method—in the example, <aRes> or <bRes>. 
This corresponds to a JML method specification clause [12]. 


Assumption = [ <ALabel> ] assume Expr; 

Assertion := [ <ALabel> ] assert Expr [ assuming ALabelPDNF ]; 
OnlyAssuming := assuming ALabelPDINF; 

ALabe1PDNF := < [ ALabel [, ALabel]* ] > [ or ALabelPDNF ] 


Fig. 1. Syntax of labeled assumptions /assertions. 
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//@ assuming <aRes>; 
int min(int a, int b) { © 
int m = a; 

if (b < m) m = b; 

//® <aRes>assert m==a; 

return m; 


} 


//@ assuming <aRes>; 

int min(int a, int b) { 
//@ <aLb>assume a<=b; 
int m =a; 


if (b < m) m= b; 


//@ <aRes>assert m==a assuming <aLb>; 


} 


//@ assuming <aRes> or <bRes>; ©) 
int min(int a, int b) { 
//@ <aLb>assume a<=b; 
int m = a; 
if (b < m) m = b; 
//@ <bRes>assert m==b; 
//@ <aRes>assert m==a assuming <aLb>; 
return m; 


//@ assuming <aRes> or <bRes>; 
int min(int a, int b) { 
//@ <aGb>assume a>=b; 
//@ <aLb>assume a<=b; 
int m =a; 
if (b < m) m = b; 
//@ <bRes>assert m==b assuming <aGb>; 
//@ <aRes>assert m==a assuming <aLb>; 
return m; 


return m; 


Fig. 2. Simple implementation of int min(int,int) with four validation steps. 


2.2 Validation Procedure 


A validation assistant software is intended to guide an expert in validating all 
program runs without having to scrutinize each single run. In each validation 
step, the expert validates a single program run against an assertion and provides 
justifications for his or her judgment in form of assumptions. The validation 
assistant knows a current set of assertions G at certain source code locations 
and a set of program runs R. In the beginning, G includes merely one implicit, 
trivial assertion (assert true, always satisfied) at the (virtual) termination 
point of the program. The set G grows after each validation step. 


Example 1. We perform the validation procedure for int min(int,int), Fig. 2. 
In the initial setting (omitted from the figure) we just have the source code 
without any annotation. 

In the first validation step (D, the expert is given a program run with input 
a==3, b==7, return value m==3 and the implicit assertion at the termination 
point. The expert judges this run to be valid, places assuming <aRes> above 
the method (virtually at the termination point), and then <aRes>assert m==a 
as justification. Verification tools check the implicit (and trivial) assert true 
at the virtual termination point under <aRes>. In (2), the expert looks at the 
same program run, but now he has to give assumptions for the assertion aRes. 
The program run is still valid under the new assertion. The expert now adds an 
assuming <aLb>, and places the assumption <aLb>assume a<=b at the start of 
the method. Tools check assertion aRes under aLb. 

In 8), the expert is given a different program run a==9, b==0, m==0, plus the 
trivial assertion at the virtual termination point. The expert adds or <bRes> 
in the corresponding assuming, and <bRes>assert m==b before the method 
returns. Again, tools verify. In @, the expert looks at the same program run 
as before, Since assertion bRes is now assuming aGb, assumption aGb is added. 
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Tools check and the validation procedure ends at this point successfully with a 
partially specified program, no bug was found. 


We consolidate into the general description of the validation procedure: 


Validation Assistant. Given sets of program runs R and assertions G, the 
latter containing only the implicit assert true at termination point. Repeat: 
1. Choose! r € R, g € G such that: 

(a) Assertion g is reached and satisfied in r. 

(b) If g has an assuming clause, then none of the disjuncts in its PDNF is 

already satisfied in r. 

If there is no such r, g, the validation terminates. 

2. Validation step (see below). 


Validation Step. Given assertion g € G and program runr € R: 
1. The expert judges r under g to be valid. 
Otherwise, a bug has been found and validation is aborted to fix it. 
2. The expert adds a conjunction <ALı,...,ALn> as a disjunct in the 
assuming of g. 
(a) In case of assuming <>, continue with 4. 
3. For 1 < k < n such that assertion/assumption AL, does not exist yet, do 
one of the following: 
(a) Expert adds assumption labeled with ALg. 
(b) Expert adds assertion labeled with AL, (initially without assuming). 
The new assertion is added to G. 
4. Verification tools check assertion g under <AL;,...,AL,> as follows: 
(a) All assumptions/assertions ALı,...,ALn are satisfied in r. 
(b) For all? € R: if ALy,...,ALn are satisfied inf then g is also satisfied. 
(c) Attempt to formally verify g assuming ALı,...,ALn, see Sect. 3. 


3 Checking/ Verification 


Formal verification can be achieved by translating labeled assertions into ordi- 
nary JML assertions as described below. The latter can be handled with state- 
of-the-art verification tools: For example, we can combine static verification and, 
for each program run separately, run-time assertion checking. 

The translation processes one single assertion and its corresponding assump- 
tions at a time and generates a separate verification task for each. For example, 
take assertion aRes from (4) in Fig. 2. There is just one corresponding assumption 
aLb, so we delete all other assumptions in the source file. The resulting code is left 
with only two annotations: //@ assume a<=b; and //@ assert m==b; without 
labels and assuming. 

The translation of general ALabelPDNFs is more complex, for example, as- 
sertion pricePlausible in Fig. 4, line 19. We must show the assertion holds, 


1 If possible, choose the same run or the same assertion as in a previous iteration. 
This simplifies the validation step for the expert. 
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given either <dscdReg,minPr> or one of the other two disjuncts. We create 
three verification tasks. In the first dscdJun and minPr are JML assumptions 
and pricePlausible becomes a JML assertion (similar for the other two). We 
obtain assume discount==0, assume movie.getPrice() > 5.60, as well as 
the assertion assert dscdPrice >= 5.00. Observe that the labeled assertion 
dscdJun is translated into an assumption. 

After translation, we can perform checks with any tool that understands JML 
and JAVA. We plan to use deductive verification as well as run-time assertion 
checking tools for every single program run. Depending on the result from the 
tools, disjuncts in the assuming are highlighted in different colors, as in Figs. 3, 4: 

white Assertion unchecked 

red Assertion is violated in some run 

green Assertion is formally verified 
blue All runs are fine, but verification only partial due to system limitation 
yellow All runs are fine, but verification failed and gave a counter example 

To demonstrate our approach, we wrote a script to translate annotations of 
all three examples in this paper [8]. We successfully reproduced the respective 
assertion verification. We expect that the performance of deductive verification 
tools is practical, as a side gain from the restricted syntax. 


4 Examples 


We demonstrate the validation procedure with two examples. Example 3 is less 
algorithmic and oriented towards real-world software, where an expert familiar 
with the application domain is essential for validating a software’s behavior. 
Example 2 features an implementation of int max(int[]), which produces in- 
correct results for certain inputs. We will find the bug in two validation steps. 


Example 2. Fig. 3 displays an implementation of int max(int[]). It produces 
incorrect results for some inputs. However, we might not detect this immediately, 
as it gives correct results in the majority of cases. Moreover, it does not throw an 
exception, except when a. length==0. The supplementary material [8] contains a 
list of 100 random input arrays we used in the experiments. Each array contains 
between one and four random entries with values in [0, 100] (equally distributed). 
From that list, 11 of 100 runs give an invalid result. 

Initially, the code in Fig. 3 is not annotated. Then we start the validation 
procedure. The set of initial goals consists of the return point of max(a) and we 
will, as usual, start there. The assistant chooses a program run, for example, cor- 
responding to input a = {35,38,36,55}. Now the domain expert performs the 
first validation step. The expert observes that result 55 is correct. The expert 
slightly generalizes this: Whenever a.length == 4 and a[3] is greater/equal 
than each of the other three elements, then the result must be a[3]. Con- 
sequently, the expert adds assertion max3res and assumption max3o0f4 as in 
Fig. 3. Both names were chosen by the expert. Now the tool checks whether the 
assertion holds, whenever max3res holds. It turns out we have six runs (of 100 
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//@ assuming <max3res> or <max0res>; 
int max(int a[]) { 
//@ <max30f4>assume a.length==4 && a[0]<=a[3] && a[1]<=a[3] && a[2]<=a[3]; 
//@ <maxOofi>assume a.length==1; 
//@ <max0of4>assume a.length==4 && a[1]<=a[0] && a[2]<=a[0] && a[3]<=a[0]; 
int m = a[0]; 
for (int k=0; k < a.length; k++) { 
if ( m < alk]) { 
m = alk++]; 
} 
} 
//@ <max3res>assert m==a[3] assuming <max3o0f4>; 
//@ <maxOres>assert m==a[0] assuming <max0of1> or <max0of4>; 
return m; 


} 


Fig. 3. int max(int|]) with conditioned assertion after some validation steps. 


input arrays), where the assumption max3res holds: For one run, the assertion 
is violated—for input {56,56,69,91}, the program outputs 69 instead of 91. 


Since an invalid run was found, we are done. Observe that the domain ex- 
pert merely scrutinized the initial program run, where the result was still cor- 
rect. There are cases where more iterations are necessary. For example, the val- 
idation assistant could have started with a singleton array {70} or with array 
{81,73,26,15}. For either we would need two or more iterations as these pro- 
gram runs do not have any similarity with one of the 11 invalid runs. See Fig. 3 
for the annotated program with further assumptions maxOof1 and max0of4. 


Example 3. In a price calculation for cinema tickets, there are movies with dif- 
ferent age restrictions and ticket prices, and there are several age groups with 
different discounts. The example might get much more complex with discount 
criteria such as happy hours, theme days, or vouchers. We conjecture that our 
validation approach works in these cases, too. 


The relevant fragment of the ticket price calculation software is displayed in 
Fig. 4. Our initial goal is to validate program runs of the method nextTicket, 
starting from the termination points. The expert might first place an assertion in 
the called method calcDscdPrice, and then place the corresponding age group 
assumption in lines 5—7 of nextTicket. 


There is a subtle bug which manifests in assertion pricePlausible (line 19) 
under assumption senior. Let’s say the expert placed this assertion, because of 
the cinema’s policy to sell tickets for at least 5 €. Assertion minPr guarantees 
that the normal price for each movie is more than 5.60 €. This holds for all 
program runs but cannot be formally proven, because the implementation of 
Movie is outside of boundary of the system under validation. Accordingly, the 
corresponding assuming <> is highlighted blue. Going back to pricePlausible, 
assume movie 2 has price 5.70 in some program run, this becomes with senior 
discount 4.85, hence <dscdSen,minPr> is marked red. 
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//@ assuming <pricePlausible> or <tooYoung>; 
public void nextTicket (Scanner input) { 
System.out.print("Enter age: "); 
int age = input.nextInt(); 
//@ <junior>assume age < 16; 
//@ <regular>assume 16 <= age && age < 65; 
//@ <senior>assume 65 <= age; 
System.out.print("Select movie (1/2): "); 
int movieNumber = input.nextInt(); 
//@ <mvi>assume movieNumber == 1; 
//@ <mv2>assume movieNumber == 2; 
Movie movie = movies [movieNumber] ; 
//@ <tooYoung>assert age < movie.getMinAge() assuming <junior,mv1>; 
if (age < movie.getMinAge()) { 
System.out.println("Too young for this movie."); 
return; 
F 
double dscdPrice = calcDscdPrice(movie, age); 
//@ <pricePlausible>assert dscdPrice >= 5.00 
assuming <dscdReg,minPr> or <dscdJun,minPr> or <dscdSen,minPr>; 
System.out.printf("Your price: %.2f €\n", dscdPrice) ; 


private double calcDscdPrice(Movie movie, int age) { 
//@ <dscdReg>assert getDiscount(age) == 0 assuming <regular>; 


//@ <dscdJun>assert getDiscount(age) == 10 assuming <junior>; 
//@ <dscdSen>assert getDiscount(age) == 15 assuming <senior>; 


//@ <minPr>assert movie.getPrice() > 5.60 assuming <>; 
return movie.getPrice() * (1 - getDiscount(age)/100.0); 
} 


Fig. 4. Cinema Example. 


5 Conclusion and Related Work 


We presented a procedure to validate program runs by a software engineer while 
iteratively generating a partial specification. This helps finding semantic bugs 
fast. The annotations can be re-used, for example, in regression verification. 

Our validation procedure incorporates usage of verification and assertion 
checking tools. Assertion annotations are in use since the 1970s [14], verification 
has an even longer tradition. In contract-based verification [9], specifications are 
structured along method declarations, whereas our approach allows arbitrary 
dependencies via labeled assumptions, syntactically inspired by [4,5]. 

In [16], it is observed that in-house tests do not match the behavior of field 
program runs. Our approach directly validates the latter. Our validation is fin- 
ished if every program run is covered by assertions highlighted in green/blue. 
This suggests an alternative to their proposed solution—generating test cases 
mimicking field runs [17]. 

Even if an assertion could not be formally verified (blue/yellow/red) we 
check it against said program field runs. We believe that this will suffice in our 
setting, without excluding future enhancements. Notably, there is an approach 
to generate test cases for partially unverified assertions [5]. 

An attempt to improve code reviews by animated symbolic execution is re- 
ported in [10]. In contrast, we guide an expert systematically through the code. 
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Abstract. Product Engineering Processes (PEPs) are used for describing com- 
plex product developments in big enterprises such as automotive and avionics 
industries. The Business Process Model Notation (BPMN) is a widely used lan- 
guage to encode interactions among several participants in such PEPs. In this 
paper, we present SMC4PEP as a tool to convert graphical representations of a 
business process using the BPMN standard to an equivalent discrete-time stochas- 
tic control process called Markov Decision Process (MDP). To this aim, we first 
follow the approach described in an earlier investigation to generate a seman- 
tically equivalent business process which is more capable of handling the PEP 
complexity. In particular, the interaction between different levels of abstraction 
is realized by events rather than direct message flows. Afterwards, SMC4PEP 
converts the generated process to an MDP model described by the syntax of the 
probabilistic model checking tool PRISM. As such, SMC4PEP provides a frame- 
work for automatic verification and validation of business processes in particular 
with respect to requirements from legal standards such as Automotive SPICE. 
Moreover, our experimental results confirm a faster verification routine due to 
smaller MDP models generated from the alternative event-based BPMN models. 


Keywords: Product Engineering Processes - Verification and validation - Proba- 
bilistic model checking - Markov decision processes - Probabilistic reward CTL. 


1 Introduction 


The ever-increasing technical challenges in products, for instance autonomous driv- 
ing in automotive industries, requires Original Equipment Manufacturers (OEMs) to 
restructure their Product Engineering Process (PEP) from a mechanical-oriented to 
a system-oriented development to enable a rigorous verification and validation of its 
processes with respect to safety and non-safety requirements [5]. Additionally, legal 
authorities oblige OEMs to address consistency and traceability in their PEPs through 
compliance with standards such as Automotive Software Process Improvement and Ca- 
pability Determination (A-SPICE) [21]. As the quality of a product is dependent on its 
processes’s quality [17], consistent and qualitative processes are required for adequately 
addressing technical challenges, legal compliance and customer satisfaction. 


© The Author(s) 2022 
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A well known and most common modelling language of processes in industrial 
PEPs is Business Process Model and Notation (BPMN) [7] which we refer to as pool- 
based BPMN (pBPMN). pBPMNs provide different users with their internal process 
workflows in a graphical notation and show the communication and dependency be- 
tween different organization within the PEP. With the aim of facing the above men- 
tioned challenges, the previous work in [8] shows the need for a revision of the BPMN 
language which is called event-based BPMN (eBPMN) in this paper. The processes, 
which are modelled according to the BPMN guidelines, are enriched with events and 
time symbols while message-flows of all processes are removed. On that way we en- 
sure to capture time aspects like milestones of PEPs, to enable a communication be- 
tween processes on different levels of abstraction by means of events, to determine the 
logical dependencies between processes and finally to remove process redundancies for 
ensuring consistency and traceability in PEPs. These argumentations on the process de- 
sign motivated us to consider eBPMNs as a better design language in SMC4PEP. We 
discuss later that the eBPMN is more beneficial than its pBPMN counterpart in gen- 
erating smaller MDPs and hence, enabling faster verification routine. The core part of 
the SMC4PEP relies on converting pBPMNs to eBPMNs while implicitly reducing the 
model size which is in turn done by removing redundant processes without losing infor- 
mation. As a bi-product, it realizes consistency in PEPs by message passing on different 
levels of abstraction which is not the case if pBPMN is used as a design language. Then, 
SMC4PEP converts the generated eBPMN to an equivalent MDP described in the syn- 
tax of the probabilistic model checking tool PRISM [15]. SMC4PEP ensures not only 
the consistency in PEPs but also allows for automated verification of generated MDPs 
against formal description of requirements from legal standards such as A-SPICE. 


2 Related Tools 


There exist different tools for analyzing business processes. Due to the wide industrial 
use of the pBPMN standard, the most common tools for analyzing business processes 
use this graphical representation of processes as an initial model. 

The work of Ou-Yang and Lin in [19] provides an approach to translate pBPMNs 
to the Modified BPEL4WS representation and then to the Colored Petri-net XML (CP- 
NXML) that can finally be verified by using CPN tools. This approach has restric- 
tions in the support of split and merge conditions. The approach of Daclin et al. in [1] 
or Mendoza Morales in [18] realize a conversion of pBPMNs to a set of Timed Au- 
tomata (TA) that uses Clocked Computation Tree Logic (CCTL) for the verification. In 
the work of Lam in [16] pBPMNs are converted to the New Symbolic Model Ver- 
ifier (NUSMV) language. Then NuSMV enables an analysis of the processes using 
model checking techniques and verifying properties by the Computation Tree Logic 
(CTL). The approaches discussed in [1, 16, 18, 19] do not consider probability distri- 
butions and non-deterministic choices of processes which are required for complex 
processes such as PEP. Duran et al. [3] develop the approach of Rewriting Logic to 
enrich pBPMNs with timing and probabilistic properties. They verify stochastic prop- 
erties such as synchronization time, probability distributions by means of the Parallel 
Statistical Model Checking And Quantitative Analysis (PVeStA) tool. However, mes- 
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sage passing between different processes especially on different levels of abstraction is 
not considered. Finally, Herbert in [14] develop an algorithm for converting pBPMNs 
into MDPs, where resources like timing and probabilities are considered while mes- 
sage passing is performed between sub-processes. Nevertheless, the size of investigated 
processes is small and limited and hence, message passing between large processes in 
particular with different levels of abstraction is not considered. Moreover, the process 
model is designed with less message passing and complexity to avoid the already known 
state-space explosion in the generated MDP model which consequently means that this 
approach is not applicable on complex processes like PEPs. 


3 SMC4PEP Architecture and Workflow 


As shown in Fig. 1, SMC4PEP consists of three modules, namely: (I) Differentiator, 
(ID) Converter and (III) Generator. The Differentiator determines if the input model 
is a pBPMN or eBPMN. In case it is a pBPMN, the Converter converts the process 
model automatically to an eBPMN and moves then to the Generator. Otherwise with 
an eBPMN as input, SMC4PEP skips the Converter and moves automatically to the 
Generator. Finally, the Generator converts the eBPMN into an MDP described in the 
PRISM syntax which can directly be analyzed in PRISM. The process of generating 
the output PRISM model consists of three steps discussed as follows. 


i m Differentiator 
BPMN 4 » j Pool-Based Model! | Event-Based Model i 
model 
m Converter ~- Generator + - 
z QA OENE. Toodo eee 4 Sssplit>> i] 
Modelling tool [| | T i <<split>> 1 
a all Redundancy ! >| | > Diagrams | rol] 
a a aa ee ee a a 
wT _____ 4 Ssremove>: aop sonens OUTPUT 
& O i Pools ! ! MDP models ! 1 
mam >| 7 [azada [generate BRISK 
INPUT Cp) Oe events i PRISM 7} 
fe eam cl i ‘_ module list _| MDP model 
(sdb » in PRISM 
' Timeline cy ->L}! language 
pase eae aoe i DAT" 


Fig. 1. Architecture of the tool SMC4PEP. 


Input. SMC4PEP requires a business process model as input with no limitation of 
abstraction levels. Process models can be designed either according to the guidelines 
in [7] or [8] with different modelling tools such as Enterprise Architect [4]. Each process 
model needs to be exported as an XML document for the readability of SMC4PEP. 


SMC4PEP. The Differentiator of SMC4PEP receives the input document and checks 
the content of the BPMN model based on the syntactic and semantic differences be- 
tween eBPMN and pBPMN. According to [7] message passing between processes is 
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performed by message flows from tasks to task of the associated sub-processes, while 
each sub-process obtains its own boundary called pool. In the eBPMN approach mes- 
sage flows and pools are eliminated [8] and each sub-process obtains its own diagram. 
Then the process is enriched by events to enable message passing between each pro- 
cess. In case of a detected pBPMN, the Differentiator triggers the Converter, otherwise 
the Converter will be skipped and SMC4PEP starts automatically the Generator. 

The Converter of SMC4PEP analyzes the number of identical processes within the 
whole process model to remove first redundant processes of pBPMN that may occur on 
different levels of abstraction. Redundant processes are determined when one process 
is equal to a second process in all elements of the model. That means in all number 
and content of tasks, number and content of events, number and content of gateways, 
role/responsible person of the process as well as number and order of sequence flows. 
The definition of these elements is available in [7]. When equal processes are detected, 
SMC4PEP eliminates all equal processes apart from one. Afterwards, all pools of the 
process models are removed and each sub-process obtains its own diagram. Finally, 
message flows are eliminated and replaced with events to ensure message passing and 
logical dependencies between the processes on different levels of abstraction. Note that 
message passing of the removed processes are also considered so that only one process 
enables a communication between different levels of abstraction. Finally, the pBPMN 
initial model is converted into an eBPMN and the Converter triggers the Generator. 

The Generator requires an eBPMN which is provided either from the Differentiator 
or Converter. Then the process model is split into its number of diagrams. Afterwards, 
the Generator converts each diagram to an MDP taking into account message passing on 
different levels of abstraction by events, probability distributions and non-deterministic 
choices. Followed by the next step, the Generator of SMC4PEP generates for each MDP 
model a PRISM module list which are then combined to one main PRISM module list. 
Finally, in case of an available timeline [8] in the process model, the PRISM module list 
is enriched by the values of the timeline to consider time aspects and process execution 
costs as rewards in the MDP model described in the PRISM syntax. 


Output. SMC4PEP saves the generated MDP model described in the syntax of PRISM 
as a DAT document which can be uploaded into the probabilistic model checker PRISM. 
It is worthwhile to mention that there are quite a number of tools which are able 
to read the PRISM modelling language. Among others, model checkers Storm [2], 
PARAM [10], ePMC [11] and Modest [12] can read our generated PRISM model for 
doing model checking various properties of interests. 


4 Case Studies 


For the evaluation of SMC4PEP, we converted two different use cases with SMC4PEP. 
Before, we developed an algorithm inspired by the work of [14] to convert a pBPMN 
directly into an MDP. Note that this conversion is not applicable on complex processes 
with different levels of abstraction. Complexity means a higher number of message 
passing between processes, probability distributions and non-deterministic choices. 
Therefore, for the evaluation we assumed that in pBPMN a communication between 
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different levels of abstraction is possible by merging all diagrams to one main diagram, 
although in real processes it is not the case. This assumption is met to obtain the MDP 
sizes of the pBPMN. On that way MDP sizes generated through a pBPMN and eBPMN 
model can be compared and the effectiveness of the eBPMN can be approved. The first 
use case describes the process of testing an autonomous park pilot with three levels of 
abstraction and includes five roles where each role performs its associated task of the 
process. The second use case handles a more complex process of an urgent request for 
a change of the vehicle construction during the PEP. In total this use case extends over 
four levels of abstraction and includes eleven roles. Both use cases are provided by an 
automotive OEM. We run all experiments on an Core i7 laptop running Windows 10. 
Table 1 provides promising results generated based on SMC4PEP. The generated 
MDP model of the first use case with two levels of abstraction is for the eBPMN 33.8% 
in states and 40.7% in transitions less than for the pBPMN. Moreover, the generated 
MDP model in the third level of abstraction is in the eBPMN 67.78% in states and 
73.11% in transitions less than in the pBPMN. The build time of the MDP model for the 
eBPMN with three levels of abstraction is higher compared to the pBPMN. Note that the 
MDP model is built only once which has no impact on the run-time of model checking 
MDPs. This is indeed the case for generating a formalism like MDP from giant BPMN 
models and use it several times for model checking various properties. The generated 
MDP models of use case two with four levels of abstraction are large compared to the 
first use case due to the high number of activities, probability distributions and non- 
deterministic choices of the processes. Nevertheless, the effectiveness of the eBPMN 
for complex processes is strongly confirmed by the generated MDP size of the second 
use case on four levels of abstraction which is far less than the MDP size of pBPMN. 
Finally, our generated MDP models from eBPMN have much smaller sizes compared 
to the approach discussed in [14]. In particular, for the second use case we got several 
order of magnitudes reduction in model size which is significant for an efficient model 
checking routine. However, similar to [14] we also realize the state space explosion 
problem which can be alleviated using bisimulation minimization techniques [6,9, 13]. 


Table 1. Results of the analyzed processes. 
BPMN Use Abstraction MDP model Built 


model case level States Transitions time (s) 
pBPMN 1 2 423 1143 0.071 
eBPMN 1 2 280 685 0.037 
pBPMN 1 3 5276 21503 0.170 
eBPMN 1 3 1700 5782 0.551 
pBPMN 2 4 93x10' 14x10" 4.263 
eBPMN 2 4 17x10'° 19x10"! 0.871 


At the end, we take the PRISM tool for model checking some properties of inter- 
est described in the Probabilistic Reward Computation Tree Logic (PRCTL) [20]. It is 
worthwhile to note that for SMC4PEP we provide the first use case as an eBPMN to 
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capture time and cost aspects of the PEP by a timeline while the second use case is 
described first in pBPMN and then converted to eBPMN. Firstly, we verify some prop- 


Table 2. Model checking of eBPMN processes. 


Abstraction Use MDP model Properties 
level Case States Transitions yı Y2 v3 ya(d) Ys(wd) 
2 1 280 685 vv Vv R 267.9 
3 1 1700 5782 vv Vv 110 346.5 
4 2 17x10" 19x10! VV Vv - - 


erties based on the A-SPICE guidelines [21] by %1, p2 and y3. The properties are taken 
from the Generic Practice (GP) of A-SPICE Level 2 [21] where each level of A-SPICE 
determines the quality of the processes. The property GP 2.1.7 of A-SPICE denoted as 
pı Which requires ensuring no deadlocks in the processes and reaching the final state 
of the process with the probability of 100%. Additionally by yz we denote the prop- 
erty GP 2.1.2 which ensures the ability of performing the process to fulfil the identified 
objectives similar to pı. Moreover, the GP 2.1.3 is denoted by p3 through which we 
ensure that our process does not deviate from its original setting according to A-SPICE. 
Finally for use case one, the non-functional properties are denoted by y4 which delivers 
the minimum days (d) for performing the whole process, and by ys which enables the 
expected cost estimation of the process obtained in accumulated working days (wd). 
We have to note here that p4 is obtained by the GUI simulator of PRISM. The results 
of the property verification obtained from PRISM are depicted in Table 2. 


5 Conclusion 


In this paper we presented the new tool SMC4PEP to enable in the first phase an auto- 
mated conversion of complex process models such as PEPs that are modelled according 
to the BPMN standard [7] into revised process models based on the modelling approach 
of [8]. This conversion paves the way for consistency and traceability of complex PEPs 
by removing redundant processes and enabling an exchange between different levels of 
process abstraction. In the second phase, SMC4PEP converts the new process model 
into an MDP to capture stochastic properties of a PEP and to enable an automated 
verification of the MDP using PRISM against formal descriptions of requirements. In 
case of designing a new PEP based on [8], SMC4PEP considers also the timeline of 
processes to capture time and cost aspects of a PEP that are essential for developing a 
new product in particular in automotive and avionics industries. Finally, we approved 
the effectiveness of our tool in an automotive case study where we compared pBPMNs 
with eBPMNs and verified some properties of interest such as legal regulations from 
A-SPICE. 
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Abstract. We propose a trace-based symbolic method for analyzing 
cache side channels of a program under a CPU-level optimization called 
out-of-order execution (OOE). The method is predictive in that it takes 
the in-order execution trace as input and then analyzes all possible out- 
of-order executions of the same set of instructions to check if any of them 
leaks sensitive information of the program. The method has two impor- 
tant properties. The first one is accurately analyzing cache behaviors of 
the program execution under OOE, which is largely overlooked by ex- 
isting methods for side-channel verification. The second one is efficiently 
analyzing the cache behaviors using an SMT solver based symbolic tech- 
nique, to avoid explicitly enumerating a large number of out-of-order 
executions. Our experimental evaluation on C programs that implement 
cryptographic algorithms shows that the symbolic method is effective in 
detecting OOE-related leaks and, at the same time, is significantly more 
scalable than explicit enumeration. 


Keywords: program analysis - out-of-order execution - side channel - 
SMT solver 


1 Introduction 


There has been growing interest in recent years in detecting side-channel leaks in 
software using automated program analysis and verification techniques, due to 
the increased awareness of the threat of real-world side-channel attacks [4,15,18]. 
These are side-channel attacks because they exploit dependencies between sensi- 
tive information of the program and non-functional properties of the computing 
platform, including cache-related timing variations caused by CPU-level opti- 
mizations such as pipelining and branch prediction. While there are existing 
methods for detecting these side channels based on static analysis [6, 28,31] and 
symbolic execution [3, 10-12, 29], they do not accurately model an important 
CPU-level optimization called out-of-order execution (OOE). 

Out-of-order execution is widely adopted by modern CPUs. It is possible for 
a program to be free of side-channel leaks when instructions are executed in 
the program order but have leaks when they are executed out of order. Here, 
the program order refers to the order in which instructions appear in the pro- 
gram. However, modeling out-of-order execution during program analysis is a 
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Fig. 1. SPRECA — symbolic predictive analysis for out-of-order execution. 


challenging task due to the inherently large number of possible scenarios that 
must be considered. Generally speaking, instructions within a fixed window (an 
imaginary window used to model the effect of hardware features including the 
reorder buffer, issue queue, and load-store queue) may be executed in any order 
as long as it respects the semantics of the program. Thus, given N instructions, 
the number of possible execution orders can be as large as O(N!). Since it is 
practically intractable to examine these execution orders individually, existing 
methods had to choose from the following two undesired outcomes: if they over- 
approximate, they may report bogus leaks since some infeasible execution orders 
will be included; but if they under-approximate, they may miss real leaks since 
some feasible execution orders will be excluded. 

To solve the aforementioned problem, we propose a trace-based symbolic pre- 
dictive analysis to accurately and efficiently analyze the OOE related cache 
behaviors. Here, accurately means that our method does not over- or under- 
approximate the OOE behaviors but precisely encodes these behaviors as a set 
of logical constraints; efficiently means that our method avoids enumerating the 
out-of-order executions explicitly to avoid the exponential blowup; instead it 
leverages an off-the-shelf SMT solver to conduct a symbolic analysis of the log- 
ical constraints. Our method is predictive in that, given an in-order execution 
trace of the program, it analyzes the cache behaviors of all out-of-order exe- 
cutions of the instructions that appeared in the in-order execution, instead of 
executing them. 

Fig. 1 shows the overall flow of our method, named SPRECA, which takes an 
annotated C program as input; the annotation marks program inputs as either 
public or private (secret). Internally, our method has three steps. In the first 
step, it utilizes the LLVM compiler to parse the C program, compute the program 
dependencies, and use the information to instrument the LLVM bit-code. The 
instrumented program, at run time, can generate the in-order execution trace. 
In the second step, our method encodes the set of all possible OOE related 
cache behaviors as a set of logical constraints, to be solved by an off-the-shelf 
SMT solver. In the third step, our method checks if there are secret-dependent 
divergent cache behaviors, e.g., an out-of-order execution causing a cache hit for 
one value of the secret variable but a cache miss for another value of the secret 
variable. 
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The main contribution of our work is symbolically modeling the OOE related 
cache behaviors accurately and efficiently. We design the SMT encoding (to be 
presented in Section 5) carefully to make it compact. For example, a straightfor- 
ward encoding of all possible permutations of N instructions would lead to an 
SMT formula of size O(N), since any instruction may have any other instruc- 
tion as its predecessor and, as a result, the update function must be encoded 
for each predecessor’s cache state and the current cache state. Our method, in 
contrast, avoids most of these update functions by leveraging the program de- 
pendency relations recorded in the in-order execution trace to prune away the 
infeasible permutations. 

Our method differs significantly from the method of Guo et al. [10,11] based 
on symbolic execution. While their method also uses symbolic analysis, they 
only made the program input symbolic, whereas the out-of-order executions are 
still enumerated explicitly (this is evident based on their use of a technique 
designed for speeding up explicit enumeration, called partial order reduction). 
In other words, for each out-of-order execution, they had to generate an SMT 
formula to check if it has divergent cache behaviors; as a result, they did not 
avoid the exponential blowup. In contrast, our method generates a single SMT 
formula to encode all possible out-of-order executions associated with the in- 
order execution. In addition to being more efficient, our single-formula based 
encoding can be more easily adapted to model other CPU-level optimizations 
by slightly modifying how dependencies are encoded as logical constraints. 

We have implemented our method in a software tool by leveraging the open- 
source LLVM compiler [17] and the Z3 SMT solver [19]. Specifically, we use 
LLVM to parse the C program, compute the program dependencies, and instru- 
ment the bit-code, to generate the in-order execution trace at run time. We use 
Z3 to implement symbolic analysis of the out-of-order executions. We evaluated 
our method on a set of C programs from OpenSSL that implement well-known 
block ciphers and cryptographic hash functions. The experimental results show 
that our method, by accurately modeling the OOE related cache behaviors, can 
detect OOE-related side-channel leaks that otherwise would have been missed. 
The results also show that our SMT solver-based symbolic analysis is signifi- 
cantly more scalable than explicit enumeration. 

To summarize, this paper makes the following contributions: 


— We propose a trace-based symbolic predictive analysis for detecting OOE 
related cache side-channel leaks. 

— We rely on an off-the-shelf SMT solver to accurately and efficiently analyze 
the out-of-order executions associated with an in-order execution trace. 

— We demonstrate the effectiveness of our method on C programs from an 
open-source library that implements well-known cryptographic algorithms. 


The remainder of this paper is organized as follows. First, we motivate our 
work using examples in Section 2. Then, we provide the technical background 
in Section 3. Next, we present our method in Sections 4 and 5, followed by 
the experimental results in Section 6. We review the related work in Section 7. 
Finally, we give our conclusions in Section 8. 
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2 Motivation 


In this section, we use examples to illustrate the cache behaviors of the in-order 
execution and an out-of-order execution. We also explain the high-level idea of 
our trace-based symbolic analysis. 


2.1 The Example Program 


Fig. 2 shows the code snippet which, for ease of presentation, is written in 
a mixture of C and simplified assembly language. Here, assume i € {0,1,2} 
is a secret variable and each array element A[i] occupies 4 bytes in memory. 
Furthermore, while our method handles realistic cache size and configurations, 
in this motivating example, we assume the cache has only one set, consisting of 
3 cache lines, with each cache line holding only 4 bytes. We assume the cache 
is fully associative, and uses the LRU (least recently used) replacement policy. 
Under these assumptions, each array element Al[i] occupies an entire cache line. 


load A[O0]; 

load A[1]; 

load A[2]; 

store A[i]; /* Can the secret value i affect the cache behavior? */ 
load B; 


ak WN 


Fig. 2. An example program where the value of i is a secret. 


2.2 The Execution Order 


The order in which instructions are written in a program is called the program 
order. During the in-order execution, instructions are executed according to their 
program order. Without loss of generality, we assume that there are two types 
of instructions: memory-related instructions such as Load and Store, and non- 
memory-related instructions, such as ALU and branch instructions. As far as 
this work is concerned, our focus is on memory-related instructions because 
non-memory instructions do not affect cache behavior !. 

Fig. 3 compares the in-order execution on the left with a possible out-of- 
order execution on the right. The out-of-order execution is a permutation of 
instructions of the in-order execution that, at the same time, must respect the 
semantics of the original program. In both of these two execution traces, each row 
represents an instruction and its associated memory address. Note that while a 
program may have if-else statements and thus multiple paths, an execution trace 
corresponds to only one program path. 


1 Non-memory instructions may impose ordering constraints over memory-related in- 
structions. These constraints are computed by our method, and used to constrain 
the analysis of out-of-order executions; details are in Section 4. 
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1 In-order OQut-of-order 

2 Iz: load 0x77ef5bd0 /*A[0]*/ II Ij: load 0x77ef5bd0 /*A[0]*/ 
3 I2: load Ox77ef5bd4 /*A[1]*/ II Ig: load Ox77ef5bd4 /*A[1]*/ 
4 I3: load 0x77ef5bd8 /*A[2]*/ Il I3: load Ox77ef5bd8 /*A[2]*/ 
5 I4: store Ox77ef5bd0,... /*A[i]*/ I I5: load Ox77ef5bdc /*B */ 
6 I5: load 0x77ef5bdc /*B */ II I4: store Ox77ef5bd0,... /*A[i]*/ 


Fig. 3. Two execution orders of the example program in Fig. 2. 


2.3 The Cache State 


Given a program execution, regardless of whether it is the in-order execution 
or one of the out-of-order executions, it is straightforward to compute changes 
of the cache state at each step. The cache state of our running example can be 
defined as a tuple S = (Age(A[0]), Age(A[1]), Age(A[2]), Age(B)), consisting 
of the ages of cache lines associated with the four program variables. Since we 
assume that the cache holds at most 3 variables (lines) at any moment if a 
variable is inside the cache, its age must be 0, 1, or 2; and if it is evicted from 
the cache, its age must be 3. Initially, the cache state is So = (—1, —1,—1, —1), 
where -1 is a special symbol meaning it is not loaded into cache yet. 


In-Order Cache Behavior As shown in Fig. 4 for the in-order execution, execut- 
ing the first instruction load A[0] changes the cache state to Sz, = (0, —1, —1, —1) 
from So, where $7, is the cache state after executing Jı. That is, variable A[0] 
now occupies the youngest cache line. Similarly, after executing the first three 
instructions, the cache state becomes Sz, = (2,1,0,—1), meaning that A[2] oc- 
cupies the youngest cache line and A[0] occupies the oldest cache line. Thus, 
executing the instruction store Afi] results in a cache hit regardless of whether 
i = 0,1, or 2. At this moment, the age of variable B remains -1 since it has not 
yet been loaded to the cache. 


1 In-order (for i==1) In-order (for i==0) 

2 Sr, = ( 0,-1,-1,-1) /*A[0] ColdMiss*/ II Sy, =( 0,-1,-1,-1) /*A[0] ColdMiss*/ 
3 Sty = ( t, 0,-1,-1) /*A[1] ColdMiss*/ II Sy, =( 1, 0,-1,-1) /*A[1] ColdMiss*/ 
4 Sr, = ( 2, 1, 0,-1) /*A[2] ColdMiss*/ II Sr =(2, 1, 0,-1) /*A[2] ColdMiss*/ 
5 Sr, = ( 2, 0, 1,-1) /*A[i] Hit */ II Sr, = (0, 2, 1,-1) /*A[i] Hit */ 
6 Sr, = (3, 1, 2 0) /*B ColdMiss*/ Il Sr = (1, 3, 2, 0) /*B ColdMiss*/ 


Fig. 4. Cache behavior of the in-order execution does not depend on the secret value 
i; that is, for all i = 0,1, 2, accessing Afi] results in a cache hit. 


Out-of-Order Cache Behavior There can be many out-of-order executions, or 
permutations of instructions, corresponding to an in-order execution. While they 
must preserve the semantics of the in-order execution, they do not have to pre- 
serve its cache behavior. Thus, even if the in-order execution does not have 
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divergent cache behaviors (with respect to a secret variable), one of the out-of- 
order executions may have divergent cache behaviors. As shown in Fig. 5 for this 
particular out-of-order execution that reorders store A[i] and load B, when 
i #0, accessing A[é] results in a cache hit, but when i = 0, it results in a cache 
miss. 


al Out-of-order (for i==1) Out-of-order (for i==0) 

2 Sr, ’= ( 0,-1,-1,-1) /*A[0] ColdMiss+/ Wl Sy, ’= ( 0,-1,-1,-1) /*A[0] ColdMiss*/ 
3 Sry’= (1, 0,-1,-1) /*A[1] ColdMiss+/ I SI ?’= (1, 0,-1,-1) /*A[1] ColdMiss*/ 
4 S73’= (2, 1, 0,-1) /*A[2] ColdMiss*/ I] S73 ?= ( 2, 1, 0,-1) /*A[2] ColdMiss+/ 
5 S7,7= (3, 2, 1, 0) /*B ColdMiss*/ II  Sr’=( 3, 2, 1, 0) /*B ColdMiss*/ 
6 Sy,’= (3, 0, 2, 1) /*A[i] Hit */ II Syy?= (0, 3, 2, 1) /*A[i] Miss */ 


Fig. 5. Cache behavior of the out-of-order execution depends on the secret value 7; 
that is, accessing Afi] results in a cache hit when i # 0 but a cache miss when i = 0. 


2.4 The Side-channel Leak 


Whenever the cache behavior of an execution (regardless of whether it is the 
in-order execution or an out-of-order execution) depends on the value of a secret 
variable, it is called a side-channel leak. This is a security risk because, in modern 
CPUs, a cache hit only takes 1-3 CPU cycles whereas a cache miss may take 
up to a hundred CPU cycles. By observing the difference in the execution time 
of a victim program, the attacker may be able to deduce a certain amount of 
information about the secret. 

In our running example, since store A[i] is dependent on the value of the 
secret variable i, we need to check if executing store A[i] leads to divergent 
cache behaviors. During the in-order execution, the answer is no, since it results 
in a cache hit for all i = 0,1, and 2. Thus, the in-order execution has no side- 
channel leak. During one of the out-of-order executions, however, the answer is 
yes, since it results in a cache hit for some value of i but a cache miss for some 
other value of 7. Thus, the out-of-order execution has a leak. 

Generally speaking, there are two types of side-channel analysis techniques: 
approximate and accurate. While over- or under-approximation may be fast, 
it leads to poor results, i.e., reporting bogus leaks or missing real leaks. Thus, 
we are only concerned with accurate analysis techniques. In this context, while 
it is possible to examine each individual out-of-order execution, it will lead to 
exponential blowup. Our method, in contrast, encodes the cache behaviors of all 
out-of-order executions in a single logical formula. The formula is then solved 
using an efficient, off-the-shelf SMT solver to avoid an exponential blowup. 


3 Preliminaries 


In this section, we present the technical background related to our analysis of 
the out-of-order executions and divergent cache behaviors. 
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Fig. 6. The instruction window and the different execution orders. 


3.1 The Execution Model 


Recall that modern CPUs may execute instructions of a program in any order as 
long as the end result remains the same. The default order is the program order, 
i.e., the order in which instructions appear in the program. For performance 
reasons, however, the CPU does not always follow the program order, because 
some instructions may be significantly slower than others and, instead of waiting 
for the slower instructions to complete, the CPU may choose to execute some 
subsequent instructions as long as the program semantics is preserved. 


Instruction Window As shown in Fig. 6, we use an imaginary instruction window 
to abstract the behavior of various hardware components inside the CPU for 
supporting out-of-order execution. The size of this instruction window depends 
on the CPU, including but not limited to the sizes of its reorder buffer, issue 
queue, and load-store queue. For this work, however, there is no need to delve into 
the hardware details. Instead, it suffices to assume that within this imaginary 
window of N instructions, the CPU may choose any execution order as long as 
the end result remains the same. 


Data Hazards To make sure that the end result remains the same, only the out- 
of-order executions that respect the data dependencies of the original program 
are allowed. In the computer architecture literature, violations of such depen- 
dencies are called hazards. Specifically, there are three types of hazards, named 
RAW (read after write), WAR (write after read), and WAW (write after read), 
respectively. It is worth noting that RAR (read after read) is not a hazard. 


3.2 The Cache Model 


Without loss of generality, we assume the cache has K cache lines in total and 
each cache line has 64 bytes. The cache lines are further divided into M sets, 
which means each set has (K/M) cache lines. The memory is also divided into 
64-byte blocks, each of which is mapped to a unique set. Within the same set, 
however, the 64-byte block may occupy any of the cache lines. Thus, within the 
set, it is called fully associative; overall, the entire cache is called set associative. 
In this context, a fully associative cache is a special case (K-way set associative), 
while a direct mapped cache is another special case (1-way set associative). 
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The Cache State The cache state is a tuple Sz = (Age(v1),..., Age(vn)), where 
each v; € Vars (1 < i < n) is a variable in the program, and Age(v;) is the 
age of the cache line associated with v;. Vars is the set of all variables. Here, 
we use the subscript in Sz to indicate that it is the cache state resulting from 
executing the instruction J. Assume that K is the number of cache lines in a 
set. The domain of Age(v;) is {0,1,...,&,—1}, where an age from 0 to K — 1 
means the variable is inside the cache, while K means the variable is evicted 
from cache and —1 means it has never been loaded into cache. 

We assume that the cache uses the LRU (least recently used) replacement 
policy. Given a cache state Sr and an instruction I’, the new cache state Sy 
is computed by the Update(S;, I’) function. Assuming that v € Vars is the 
variable used by the instruction I’, u} € Vars is another variable whose age was 
younger than v in Sz, and u2 E€ Vars is yet another variable whose age was older 


than v in S;, we compute the new cache state Sy = (Age’(v1),..., Age’(Un)) as 
follows: 

— Age'(v) = 0; 

— Age' (u1) = Age(ui) + 1; 


That is, the most recently used variable (v) occupies the youngest cache line, 
any variable (u1) whose age was younger than v in Sq increases its age by 1, and 
any variable (u2) whose age was older than v in S; keeps its age unchanged. 


3.3 The Side-channel Leak Condition 


Whenever there is a dependency between the secret and some divergent cache 
behaviors of an execution, there is a side-channel leak. Thus, there are two re- 
quirements. First, there must be divergent cache behaviors, i.e., memory-related 
instruction causing a cache miss for some input value but a cache hit for some 
other input value. Second, the input value causing divergent cache behaviors 
must be a secret, e.g., a password, security token, or cryptographic key. 

Thus, the side-channel leak condition can be defined as follows: 


J E,I,v1, v2 . CacheStatus(E,I,v,) Æ CacheStatus(E, I, v2) 


Here, E denotes an execution, and J € E is an instruction in E; vı and v2 are two 
values of a secret variable vs E€ Vars; and CacheStatus(E,I,vs,) is a function 
that returns the cache status (hit or miss) when instruction I is executed in E 
using Us. 


4 Analyzing the In-Order Execution 


In this section, we present our method for generating, and then analyzing the 
in-order execution trace. There are two tasks. The first one is to compute the 
dependencies of memory-related instructions. The second one is to compute the 
default cache states. Both the dependencies and the default cache states will be 
used during our symbolic analysis of the out-of-order executions. 
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4.1 Computing the Dependencies 


There are two types of dependencies associated with the in-order execution of a 
program: explicit dependencies and implicit dependencies. 


Explicit Dependencies Explicit dependencies refer to data conflicts that can be 
directly observed during the execution, by looking at the actual addresses of 
memory blocks used by the instructions at run time. Consider the in-order ex- 
ecution example in Fig. 3 (left). Since both instructions I, and I, access the 
memory block at the address 0x77ef5bd0, and at least one of them is a store 
operation, these two instructions have an explicit dependency; that is, they can- 
not be reordered during out-of-order. 


1 load ri A[0]  /*LD A[0]*/ 
2 mul ri 5 /* */ 

3 add r2 ri /* */ 

4 mov r3 r2 /* */ 

5 

6 


store A[1] r2 /*ST A[1]*/ 


Fig. 7. Example implicit dependency that cannot be observed in the execution trace. 


Implicit Dependencies Implicit dependencies, on the other hand, refer to data 
conflicts that cannot be directly observed during the in-order execution. Fig. 7 
shows an example. The code snippet shows that store A[1] is dependent on 
load A[0], through the def-use chain of (register) variables r1-r3. Since non- 
memory instructions (mul, add, mov in this example) do not show up in the 
logged execution trace, their constraints on the memory instructions would have 
been lost if we do not compute and record them explicitly into the execution 
trace. 

In our method, we compute the implicit dependencies by statically analyzing 
the LLVM bit-code of the program before instrumenting the bit-code to add 
self-logging capabilities. Then, we execute the instrumented code to obtain the 
trace. As a result, the implicit dependencies will be captured in the execution 
trace as a special relation (DE P;ta). Static program analysis has a global view 
of the program and thus is well suited for computing the implicit dependencies. 
Inside LLVM, the bit-code is represented in a Single Static Assignment (SSA) 
format, meaning each variable is defined only once, which makes it possible to 
efficiently compute the implicit dependencies [20]. 

In addition to the implicit dependencies (DE P.:,) computed by static anal- 
ysis, we also compute the explicit dependencies (DE Piy,,) based on the actual 
addresses appeared in the execution trace: for each memory address, instruc- 
tions that use the address are checked to see if they have data hazards (RAW, 
WAR, or WAW). For instructions that have data hazards, their relative execu- 
tion order during in-order execution cannot be violated; otherwise, the original 
program semantics may be changed. 
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Given both the statically computed DE P,;, and the dynamically computed 
DEPayn, we compute their transitive closure to obtain DEP = (DEPsta U 
DEPayn)*, which represents the complete set of dependency constraints that 
must be respected at all time, to ensure that the out-of-order executions exam- 
ined by our symbolic analysis are feasible. 

The fact that static analysis is conservative in nature will not affect the 
correctness of our subsequent symbolic analysis. Since not all memory-addressing 
instructions can be statically resolved, as shown by the example instruction 
store A[i] in Fig. 2, static analysis may soundly over-approximate the possible 
dependencies of memory-related instructions. This is not a problem because it 
guarantees that, as long as two instructions are marked as independent, it is 
always safe to reorder these instructions during out-of-order execution. This is 
crucial for ensuring that leaks detected by our method are feasible. 


4.2 Computing the Default Cache States 


Given the in-order execution trace, we perform an in-order simulation to compute 
the default cache states, which will be used during our symbolic analysis of the 
out-of-order executions. 

We regard the in-order execution trace as a sequence of instructions Tino = 
{I1,...,In}. The type of each instruction may be Load, Store, Symbolic Load, 
or Symbolic Store. Each Load/Store instruction is associated with an actual 
memory address. Each Symbolic Load/Store instruction is associated with a 
range of addresses that it may use. 

Starting with an initial cache state So, we compute the sequence of cache 
states Teache = {50,Sr,---,S1,} using the update function defined in Sec- 
tion 3.2. While the update function in Section 3.2 uses the LRU replacement 
policy, other cache replacement policies can also be implemented easily. 

The result of in-order simulation will be given to our symbolic analysis, to 
examine the set of all possible out-of-order executions. Here, an out-of-order 
execution, denoted Tooe = {T}, ..., I4}, is a permutation of instructions of the 
in-order execution. That is, for all 1 < i < n and instruction J; € Tino, there 
exists 1 < j <n,i 4 j such that Ij € Tooe and Ij = I;, and vice versa. 


5 Analyzing the Out-of-Order Executions 


In this section, we present our method for symbolically analyzing the out-of-order 
executions. 


5.1 Symbolic Encoding 


Our method uses a single logical formula (®) to encode the behaviors of all out- 
of-order executions of instructions within a sliding window of size N, together 
with the condition under which an out-of-order execution has secret-dependent, 
divergent cache behaviors. It guarantees that @ is satisfiable if and only if there 
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exists such a side-channel leak in the sliding window of size N. Thus, when 

setting the value of N, there is a trade-off between coverage and scalability. 
Before explaining how @ is constructed from the in-order execution trace, 

however, we need to define the notations used in the symbolic encoding. 


— Sliding Window: We focus on a sliding window of N instructions appeared 
in the in-order execution trace. Within this window, instructions may be 
executed in any order as long as they respect the DEP relation; outside of 
this window, instructions are executed in-order. 

— Program Counter: We use (N + 1) variables PC_Ip, PC_h,...,PC_In 
to represent the time when we execute the N instructions ,...,Iy. The 
special variable PC_Ip represents the start time, and each PC_I; (where 
1 < i < N) represents the time immediately after I; is executed. 

— Age of Address after Executing an Instruction: We use Age_addr,_I; 
to represent the cache line age of a memory block at addr; after we execute 
instruction J;. Thus, for all memory addresses addr,,...,addrjz, we have 
integer variables Age_addr,_I;,..., Age-addrm I; for all 0< i< N. 


With these notations, we define the formula ® as a conjunction of the following 
subformulas: 


P= Pye A Des A Pics A Prep A Pep A Plive 


where pc is the program counter constraint, Pes is the cache state constraint, 
Pics is the initial cache state constraint, Prep is the cache replacement con- 
straint, Pdep is the dependency constraint, and ®gj,- is the divergence condition 
constraint. 


Program Counter Constraint (pc) To get a total order of the N instruc- 
tions, we require that, for all 0 < i < N, the value of PC_I; is unique; further- 
more, we require 0 < PC_I; < N. Thus, the constraint is defined as 


Dyo= A (OS PCI<N)A A (PCI; # PC_I;) 
O<i<N 0<i,j<N and iżj 


Cache State Constraint (es) Let MAX be the cache’s associativity, or the 
maximal number of cache lines that can be mapped to a memory address. After 
executing an instruction J;, if 0 < Age_-addr,_I; < MAX, it means the memory 
block at addr; is inside the cache; but if Age_addr;_I; = MAX, it means the 
memory block is evicted out of the cache ?. Thus, the constraint is defined as 


Pos = VAN (—1 < Age_addr,_I; < MAX) 
0<i<N and 0<k<M 


? Age_addr;,_I; = —1 means it has never been loaded to the cache yet. 
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Initial Cache State Constraint (®;.,) Before the first instruction is exe- 
cuted, the cache must be set to a proper initial state. In other words, variables 
Age-_addr,_Io,..., Age_addrj4_Ip must be initialized based on the default cache 
states computed by in-order simulation (Section 4.2). Thus, the constraint is 
defined as 
Pics = VAN (Age-addrp-Io = init_age_addr,) 
O<k<M 


Replacement Constraint (@rep) Assuming that instruction J; is immedi- 
ately before J; during an out-of-order execution, we define the cache line ages 
after executing J; based on their ages after executing the predecessor instruc- 
tion I;. Let addr; be the address used by J;, addr,; be any address whose age 
was younger than that of addr; immediately before executing J;, and addr, be 
any address whose age was older than that of addr;. According to the update 
function defined in Section 3.2, we set Age_addr;_I; to 0, set Age_addr;,_I; to 
(Age_addr;,;-I; + 1), and set Age_-addrj2I; to Age-addr;2I;. Let the relation 
UpdateRel(I;, Ij) be the conjunction of the constraints defined above. 

If a symbolic address (secret-dependent) is used by I;, we encode it into the 
update relation as follows: for each concrete address that may be instantiated 
from the symbolic address, we construct an update relation UpdateRel() under 
the assumption that it may be the actual address used by Jj. 

Overall, the cache replacement constraint is defined as 


rep = \ (PCI; = PC_I; +1) — UpdateRel(I;, I;) 
0<i,j<N and iżj 


Dependency Constraint (dep) To ensure that out-of-order executions are 
feasible, we enforce the relative order of any two instructions if they have de- 
pendencies according to the DEP relation. Thus, the constraint is defined as 


Bdep = \ (PCI; < PC_I;) 
0<i,j<N and i4j and DEP(I;,I;) 


That is, if J; depends on J;, I; must be executed before J;. 


Divergent Cache Constraint (Pdivc) Let Var, be a symbolic (secret) vari- 
able whose values include v1, v2,... and let I; be a symbolic instruction whose 
actual addresses include addr,,,addr,,,... Here, the value vı corresponds to 
addr,, and the value v2 corresponds to addr,,. If accessing the memory block 
at addry, leads to a cache hit and accessing addr, leads to a cache miss (or 
vice versa), the target instruction J; has divergent cache behaviors. Thus, the 
constraint is defined as 


Paive = VV (0 < Age_addr,, I; < MAX) A (Age_addr,, I; > MAX) 


Vv1,v2 
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Conjoining all of the subformulas defined above, we can construct the entire 
formula ® which is satisfiable (SAT) if and only if there is a side-channel leak 
during one of the out-of-order executions. 


5.2 The Overall Algorithm 


The overall algorithm for predictive cache analysis is shown in Algorithm 1, 
which takes the in-order execution trace Ting = {l1,..-,In}, the in-order cache 
state trace Teache = {So,---, Sn}, and the sliding window size N as input. In- 
ternally, it uses a sliding window of N instructions, Twindow, to generate the 
SMT formula ®. For this window, Sinit is the initial cache state as computed 
by in-order simulation, and Itargee is the target instruction. The formula ® is 
satisfiable if and only if an out-of-order execution of the instructions within the 
window leads to divergent cache behaviors at the instruction [target. 


Algorithm 1 SYMBOLICCHECK(Tino, Teache, N) for predictive cache analysis. 
1: for pos + 1 to (n— N) do 

2: first = (pos — N > 0) ? (pos — N): 1 

Twindow = Tino[ first, pos] 

Liarget = Tino|pos] 

Sinit = Teache[first — 1] 

@ = BUILDFORMULA( Twindow, target; Sinit ) 

if ( SAT(®) == true ) print LEAK_-FOUND 


Running Example We use the example code snippet in Fig. 2 to illustrate the 
symbolic encoding presented in this section. For this example, the in-order ex- 
ecution trace generated by our method is shown in the top half of Fig. 8. Note 
that A is marked as symbolic since A[i] is affected by the unknown variable i. 
The logical constraints are shown in the bottom half. Assume that the target 
instruction is I4, meaning that we want to construct a formula ® to check if I4 
has divergent cache behaviors. 

The program counter and cache state constraints are shown in Lines 10- 
12; recall that each program counter variable must have a unique value. The 
dependency constraints are shown in Line 13. Then, in Line 14, we show the two 
symbolic variables used to check divergent cache behaviors; their values are in 
the range of the symbolic store in Line 5. 

The update function for Instruction I, starts from Line 15. If vj==0x77ef5bd0, 
which means 0x77ef5bd0 is used, the age after executing I; is set to 0. The de- 
pendency relations indicate that I; is allowed to execute before I4. From Line 
16 to 18, we show an example update age constraints with program counter con- 
straint and the condition which Age_0x77ef5bd4_I4 would increase by 1 from its 
predecessor Iş according to Section 5.1. Similarly, we encode other predecessors 
of I; for the update function in Line 19. Finally, we encode the divergent cache 
constraint in Line 20. 
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1 In-Order Execution Trace 

2 Iz: load Ox77ef5bd0 /*A[0] */ Dependency relations: 
3 Ig: load Ox77ef5bd4 /*A[1]*/ | 

4 I3: load Ox77ef5bd8 /*A[2]*/ vV 

5 I4: symbolic store 0x77ef5bd0, 0x77ef5bd4, 0x77ef5bd8 /*A[i]*/ <I ,,I4> <Ig,I4> <I3,14> 
6 I5: load 0x77ef5bdc /*B  */ 

T Initialize Ages: Age_0x77ef5bd0_init == 0, Age_0x77ef5bd4_init == 0 

8 Age_0x77ef5bd8_init == 0, Age_0x77ef5bdc_init == 0 

9 PC Constraints: 1 < PC_Ij.5 <5, PC_Ip == 0, distinct(PC_I;) 

10 Age Constraints: -1 < Age_0x77ef5bd0_I; < 3, -1 < Age_0x77ef5bd4_I; < 3 

11 -1 < Age_Ox77ef5bd8_I; < 3, -1 < Age_Ox77ef5bdc_I; < 3 

12 DEP Constraints: PC_I} < PC_I4, PC_Ig < PC_I4, PC_I3 < PC_I4, PC_Ig < PC_I1:5 

13 Symbolic Var: vi/vq €E {0x77ef5bd0, Ox77ef5bd4, Ox77ef5bd8}, vı A v2 

14 Update Function: vı == Ox77ef5bd0 = > Age_0x77ef5bd0_I4 == 0 

15 - I4.Pred is I5: (PC_I5 + 1 == PC_I4 A Age_O0x77ef5bd4_I5 > Age_Ox77ef5bd0_I5, 

16 A Age_0x77ef5bd0_I5, # -1 A Age_Ox77ef5bd4_I5 A -1) 

17 = > = Age_Ox77ef5bd4_I4 = Age_Ox77ef5bd4_I5 + 1; ...... 

18 - I4.Pred is I}, Ig, Ig: ...... 

19 DivC Constraint: Age_vij_I4 > 3 A Age_vo_I4 < 3 A Age_vo_I4 $i 


Fig. 8. An example encoding where the register variable i holds a secret value . 


5.3 Optimizations of the Symbolic Encoding 


Without optimization, the size of the formula ® may be as large as O(N?M) 
in the worst case, where N is the number of instructions in the sliding window 
and M is the number of memory addresses used inside the window. In practice, 
however, many of the logical constraints can be skipped. Here, we propose two 
optimization techniques. 


Skipping the Infeasible Cache Update Relations While constructing the con- 
straints that update the cache states of the instructions, the default approach 
is to assume that, for any instruction J;, any other instruction J; in the same 
window may be executed immediately before I;. This means it must construct 
N? update relations. However, due to the dependencies among instructions cap- 
tured by the DEP relation, there may be many instruction pairs (J;,I;) such 
that J; is not allowed to execute before [;. By leveraging the information, we 
can skip many of these update relations. 


Skipping the Unnecessary ®aiye Constraints In many cases, by checking the 
initial cache state with respect to the sliding window of N instructions, we may 
be able to know that divergent cache behaviors are impossible during any of the 
out-of-order executions. In other words, @giye is guaranteed to be unsatisfiable 
(UNSAT). Thus, we can avoid generating &. Toward this end, we check for the 
following two conditions, each of which is sufficient for ®giye to be UNSAT: 


— All ages are too young: Inside the initial cache state (with respect to the 
window), if all cache line ages are less than (MAX — M), where M is the 
number of unique addresses used in this window, we skip checking any of the 
instructions in this window for divergent cache behaviors. This is because 
the cache is large enough that, regardless of the execution order, none of the 
cache lines will be evicted. 
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Table 1. Statistics of the benchmark programs and the execution traces. 


Name Description SLOC Logged execution trace 

Length Store|# Load|# Addr Cache-line 
AES Advanced Encryption Standard 2,077|| 32,069] 8,753] 23,316 3,126 39 
DES Archetypal block cipher 1,090|| 10,162) 3,994) 6,168 946 48 
SEED Symmetric key block cipher 720| 20,820] 6,999] 13,821] 2,044 30 
Camellia |Symmetric key block cipher 555|| 14,595} 5,487} 9,108 1,63 30 
Chacha20 |Pseudorandom function based stream cipher 263|| 15,739] 3,668] 12,071 687 34 
IDEA International Data Encryption Algorithm 288 2,920 884) 2,036 318 40 
ARIA Symmetric key block cipher 1,265|| 15,672) 5,237) 10,435 1,642 28 
SM4 Symmetric key block cipher 301)| 11,362} 3,410) 7,952 1,412 31 
MD5 MD5 message-digest algorithm 312 3,134 878| 2,256 361 56 
Blake2 Hash based on ChaCha stream cipher 512 4,832] 1,363] 3,469 309 63 
SHA256 |Secure Hash Algorithm standard 825 5,900} 1,302) 4,598 435 64 
Whirlpool |Hash designed after Square block cipher 1,100 6,941] 1,915| 5,026 1,257 72 


— The age of addr accessed by the target instruction is too young: Inside the 
initial cache state, if the age of addr is less than (MAX — M), we skip 
checking this particular target instruction for divergent cache behaviors. This 
is because, regardless of the value of the secret variable, this particular cache 
line will never be evicted out of the cache. 


6 Experiments 


We have implemented our method in a tool named SPRECA, which builds upon 
the LLVM compiler [17] and the Z3 SMT solver [19]. Specifically, it uses LLVM to 
implement the static analysis component, which takes a C program as input and 
computes the dependencies of memory-related instructions before instrument- 
ing the LLVM bit-code; the instrumented bit-code, after compilation, is used 
to generate the execution trace at run time. We use Z3 to implement our sym- 
bolic analysis component, which takes the logged execution trace as input and 
generates SMT formulas of the cache states for leakage detection. Overall, our 
implementation includes 3.6K lines of C++ code inside LLVM for trace genera- 
tion, SMT encoding and leakage detection, as well as 0.5K lines of Python/Bash 
script code for processing the trace files and automation. The archive is available 
at: https: //doi.org/10.5281/zenodo.6117196. 


6.1 Benchmarks 


The benchmarks used to evaluate our tool are a set of C programs from OpenSSL 
1.1.1k that implement well-known block-ciphers such as AES and DES and 
cryptographic hashing functions such as SHA256 and Whirlpool. The statistics 
of these benchmark programs are shown in Table 1, including the name of the 
program, a short description, the number of lines of C code, and statistics of the 
logged execution trace, which serves as input of our symbolic analysis method. 
For each execution trace, we show the trace length, the number of Store (ST) 
operations, the number of Load (LD) operations, the number of distinct memory 
locations touched by the execution, and the number of corresponding cache lines. 
Our experiments were designed to answer the following questions: 
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Table 2. Results of our symbolic predictive analysis method for 8K fully associative 
cache, with LRU replacement policy, and window size set to 10. 


Name Trace length SMT solver calls made Leaking sites|Analysis time (s) 
total instances|SAT instances] UNSAT instances 

AES 32,069 0 0 0 0 246.0 
DES 10,162 1 0 1 0 620.4 
SEED 20,820 593 14 579 5 4,922. 

Camellia 14,595 366 15 351 6 2,475. 

Chacha20 15,739 0 0 0 0 4.9 
IDEA 2,920 0 0 0 0 1.0 
ARIA 15,672 1,060 0 1,060 0 8,760.2 
SM4 11,362 27 0 27 0 788. 

MD5 3,134 0 0 0 0 1.2 
Blake2 4,832 0 0 0 0 1.8 
SHA256 5,900 0 0 0 0 2.4 
Whirlpool 6,941 0 0 0 0 2.8 


— Is our method effective in detecting OOE-related cache side-channel leaks? 
— Is our method, based on symbolic analysis, more scalable than explicit anal- 
ysis? 


Toward this end, for each benchmark program, we applied our symbolic analysis 
method to check if it can find OOE-related cache side-channel leaks, i.e., leaks 
that otherwise would not show up unless out-of-order execution is considered. 
To evaluate the scalability of our method, we also compared it with a baseline 
explicit analysis method. Due to space limit, we omit the detailed algorithm 
of the explicit analysis method, which systematically enumerates the same set 
of out-of-order executions of instructions considered by our symbolic analysis 
method. Thus, both our symbolic method and the explicit method examine 
the same type of secret-dependent divergent cache behaviors, but they differ in 
efficiency and scalability. 


6.2 Leakage Detection Results 


Table 2 shows the results of our symbolic analysis method. These results were 
obtained using the following parameters: the cache has a total of 8K bytes, 
divided into 128 cache lines, with 64 bytes per cache line. The cache is fully 
associative, with the LRU replacement policy. The OOE window size is set to 
10, meaning the number of Load/Store instructions that will be executed out 
of order is bounded to 10. Recall that inside the reorder buffer, there can be 
many non-memory instructions (e.g., arithmetic operations); thus, setting the 
window size to 10 is a reasonable choice. In this table, Columns 1-2 show the 
program name and the trace length. Columns 3-5 show the number of SMT solver 
calls, the number of satisfiable (SAT) instances, and the number of unsatisfiable 
(UNSAT) instances. Column 6 shows the number of leaking sites detected by 
our method and Column 7 shows the total analysis time in seconds. 

Note that the number of SMT solver calls may be smaller than the number 
of instructions in the trace and, in many cases, is 0 because of the optimizations 
implemented during our symbolic encoding: for any instruction, if our simple 
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Fig. 9. Comparison of the analysis time: symbolic method versus explicit method. 


checks reveal that no OOE-related divergent cache behavior is possible, we skip 
the more time-consuming SMT solver call. Also note that the number of leaking 
sites in Column 6, which are locations in the original C program, may be smaller 
than the number of UNSAT instances in Column 4; this is because multiple 
UNSAT results may be mapped to the same source code location. 

To confirm that the leaking sites reported in Table 2 are indeed feasible 
(5 for SEED and 6 for Camellia), we manually inspected the source code and 
the LLVM bit-code of both SEED and Camellia. Our manual inspection shows 
that the reordered sequences provided by the SMT solver are indeed feasible as 
we check them against the source code. We also find that the divergent cache 
behaviors are real in that the two concrete values computed for each symbolic 
(sensitive) variable can indeed lead to a cache hit in one case but a cache miss 
in the other case. 


6.3 Scalability Results 


To evaluate the scalability of our symbolic analysis method, we compared its 
analysis time to that of the baseline explicit enumeration method. This experi- 
ment was conducted on SEED, with the OOE window size set to 2, 4, 6, 8 and 
10, respectively. This is because the computational complexity of the problem 
increases exponentially as the OOE window size increases. The results are shown 
in Fig. 9, where the z-axis is the OOE window size and the y-axis is the analysis 
time in seconds. The blue line represents our symbolic method while the red line 
represents the explicit method. 

The results in Fig. 9 show that, while our symbolic method has a higher 
fixed cost (associated with generating SMT formulas, calling the Z3 solver, and 
interpreting the results), and thus is slower than the explicit method when the 
OOE window size is smaller, it becomes significantly more efficient when the 
window size is larger. The figure also show that, as expected, the explicit method 
has an exponential blowup — its analysis time is actually worse than exponential 


180 Z. Huang et al. 


(factorial in the window size) — whereas the scalability of our symbolic method 
is significantly better. 


7 Related Work 


As we have mentioned earlier, the most closely related work is that of Guo et 
al. [10,11] which relies on KLEE to detect cache side channels. However, their 
method only treats program input as symbolic, while still explicitly enumerating 
the out-of-order executions. Unlike their method, we analyze the set of all pos- 
sible out-of-order executions symbolically by encoding them in a single logical 
formula to avoid the exponential blowup. In this sense, our method is the only 
predictive analysis method that can symbolically analyze the cache behaviors of 
out-of-order executions. 

Besides our method and the method of Guo et al. [10,11], there are many 
other techniques for analyzing cache side channels. Some of them use symbolic 
execution as well, e.g., to detect concurrency-related leaks [12] as well as leaks in 
sequential programs [3, 21,29,32]. Others use static analysis techniques includ- 
ing those based on abstract interpretation [6, 28,30,31]. In addition to leakage 
detection, there are techniques for leakage quantification |1, 2, 5,7,16] as well. 
However, none of these prior works considers out-of-order execution. 

Beyond side-channel leakage detection and leakage quantification, cache anal- 
ysis has been used in other applications such as estimating the worst-case ex- 
ecution time (WCET) of real-time software [9, 13,25]. Beyond cache analysis, 
the idea of trace-based predictive analysis has been applied to multithreaded 
programs to detect concurrency bugs [8, 14, 22-24, 26, 27]. However, a crucial 
difference is that while concurrency bugs are violations of functional proper- 
ties of a program, our method for side-channel analysis focuses exclusively on 
non-functional properties. 


8 Conclusions 


We have presented a symbolic method for analyzing the cache behaviors of out- 
of-order executions associated with an in-order execution trace. The method 
uses static analysis to compute dependencies before instrumenting the program 
to generate the in-order execution trace. Then, it uses an SMT solver based 
symbolic analysis to analyze the cache behaviors of all out-of-order executions. 
Our experiments on cryptographic software code show that the symbolic anal- 
ysis method is effective in detecting OOE-related cache side-channel leaks and 
is significantly more scalable than explicit analysis. For future work, we plan 
to extend our method to detect side-channel leaks caused by other CPU-level 
optimizations. 
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Abstract. Refactoring a program without changing the program’s func- 
tional behavior is challenging. To prevent that behavioral changes remain 
undetected, one may apply approaches that compare the functional behav- 
ior of original and refactored programs. Difference detection approaches 
often use dedicated test generators and may be inefficient (i.e., execute 
(some of) the non-modified code twice). In contrast, proving functional 
equivalence often requires expensive verification. Therefore, we propose 
PEQTEST, which aims at localized functional equivalence testing thereby 
relying on existing tests or test generators. To this end, PEQTEST derives 
a test program from the original program by replacing each code seg- 
ment being refactored with program code that encodes the equivalence of 
the original and its refactored code segment. The encoding is similar to 
program encodings used by some verification-based equivalence checkers. 
Furthermore, we prove that the test program derived by PEQTEST indeed 
checks functional equivalence. Moreover, we implemented PEQTEST in a 
prototype and evaluate it on several examples. Our evaluation shows that 
PEQTEST successfully detects refactored programs that change the pro- 
gram behavior and that it often performs better than the state-of-the-art 
equivalence checker PEQCHECK. 


1 Introduction 


Developers refactor programs [16] to improve quality attributes like e.g. perfor- 
mance. For instance, a developer may parallelize a program with OpenMP [30] 
to improve performance. While a refactoring changes the program code, e.g., 
adds OpenMP pragmas, to improve the program’s quality, the changes must not 
alter the program’s functional behavior. To ensure that a refactored program is 
reliable, we must check that the refactoring preservers the functional behavior. 
Various approaches exist that aim to safeguard refactored programs from 
altered behavior. In practice, developers often perform regression testing [54], but 
the success of detecting altered behavior depends on the test suite and its test ora- 
cle(s). If refactoring rules are applied, one can prove the correctness of the applied 
refactoring rules [45,22,44]. In contrast, incremental verification techniques, 
e.g., [53,39,8,35], propose solutions for efficient re-verification of changed programs, 
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Birine Ide Original prostan Listing 1.3: Generated test program 


int sum-test (unsigned char N) 


void sum-_seq(unsigned char N) { 

: int a[N+1]; 
int a[N+1]; a[O] = 0; 
a[0] = 0; 


store(a, 0); 


| LSPs cana | 
Listing 1.2: Refactored program 


store(a, 1); 
int sum_par(unsigned char N) restore(a,0); 


int a INET]; #pragma omp parallel for 
a[0] = 0; for(int i=1; i <= N; i+4 
ali] = (ix(i+1))/2; 


#pragma omp parallel for 
for(int i=1; i <=N; i++) 
ali] = (ix(i41))/2; store(a, 2); 

} eq-store(a, 1, 2); 


Fig. 1: Original, sequential program (top left), which initializes each array entry i 
with ee, j, the refactored program (bottom left), which parallelizes the array 


i-(i+1) 


z ~, as well as the 


initialization using OpenMP and utilizing that >= J= 
generated program for testing functional equivalence (right) 


but they typically need a specification of the functional behavior, which rarely ex- 
ists. Another solution, which does not require a specification, is to inspect whether 
or when the original and the refactored program behave functionally equivalent. 
Approaches aiming to detect differences in the behavior [26,52,46,20,31,29,36,47] 
are inefficient, i.e., execute each test case on the original and the refactored 
program or function, and often use dedicated test generators. Approaches aiming 
to prove functional equivalence [5,56,40,14,13,43,49,41,34,4,15,17,38,23,42,19] use 
heavyweight verification techniques, rarely support parallel programs, and often 
consider all possible variables values. 

Our goal is to develop a lightweight, test-based approach for functional equiv- 
alence checking, for which we can use existing tests or test generators. Inspired 
by equivalence checkers [17,38,23,51,42,2,19] that transform the equivalence of 
two programs into a set of verification tasks (i.e., programs with assertions), 
our PEQTEST approach transforms the equivalence of two programs into a test 
program. To restrict equivalence testing to relevant program values and to reduce 
the duplicate execution of non-modified code, PEQTEST generates a single test 
program (verification task) that executes the unchanged code only once and 
individually checks equivalence of each refactored code segment in the context of 
the original program. The individual checks use a similar idea as UC-KLEE [38], 
which verifies equivalence of functions. More concretely, PEQTEST derives the 
test program from the original program by extending each original code segment 
with (a) the refactored code segment and (b) code to store, restore, and compare 
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variable values of modified variables. To store, restore, and compare the values 
of modified variables, PEQTEST relies on checkpoints, which save the values of a 
given set of modified variables in a given program state. 

In our example (Fig. 1), PEQTEST first detects that the original (sequen- 
tial) code segment (framed, dark blue) and the refactored (parallelized) code 
segment (frameless, light blue) modify variable a’. Thereafter, PEQTEST derives 
the test program (right) from the original program (top left). It adds the paral- 
lelized code segment. To provide the same input to the original and refactored 
code segment, PEQTEST uses checkpoint 0 to store modified variables. The 
test program calls store(a, 0); to save in checkpoint 0 the values of modified 
variable abefore the original code segment and calls restore(a,0); to restore 
the values of modified variables before the refactored code segment. To make the 
result of both code segments available for equivalence checking, the test program 
stores the values of modified variable a after each code segment in checkpoint 1 
and 2, respectively. Finally, the equivalence test eq_store checks whether the 
checkpoints 1 and 2 contain equivalent values for the modified variable a. 

We proved that PEQTEST generates test programs that can indeed detect 
inequivalence and that if no execution of the test program reveals an inequiva- 
lence, original and refactored program are equivalent. As a proof-of-concept, we 
implemented PEQTEST and used it to check several program parallelizations 
and a few sequential refactorings. Our evaluation shows that PEQTEST reliably 
detects inequivalences and typically outperforms the state-of-the-art equivalence 
checker PEQCHECK [19]. 


2 Background 


Program Syntax. To present our approach, we rely on a simple imperative 
language on integer variables.” Since synchronization issues, e.g., deadlocks, 
do not affect how our approach works and we want to keep the programming 
language simple, our language supports parallel execution, but no synchronization 
operations. Below, we show the grammar of the programming language that we 
use to present our approach. 


S := E | v := aexpr; | ify bexpr then Sı else S2 | while; bexpr do S | 
S152 | [Sil] --- lS] 


We use E to denote the empty program and assume that arithmetic expres- 
sions aexpr in assignments and Boolean expressions bexpr in if and while state- 
ments are built with standard operators on integers. To build more complex 
programs S, several subprograms S; may be assembled into a sequence or into a 
parallel statement. To unambiguously identify the original and refactored code 
segments during test program generation? and any subprogram in our proofs, 


1 Both segments also modify variable i, but it is a local variable, which can be ignored. 

? Our implementation supports a subset of C programs, which may use OpenMP 
pragmas for parallelization. 

3 For our implementation, one only needs to specify the start and end of code segments i. 
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o(bexpr)=true o(bexpr)=false 


beapr sbeapr 
(ifọ bexpr then S else S2,0,€)——>(S1,0,€) (ifg bexpr then Sı else S2,0,£)————>(S2,¢0,£€) 


o(bexpr)=true o(bexpr)=false 


beapr sbeapr 


(while, bexpr do S,o,€)——>(S while, bexpr do S,o,€) (whileg, bexpr do S,0,£)——>(E,a,€) 


(Si,0,€) 2 (84,076) 
(v:=paexpr;,o,£) > (Bo [v:=o(aeapr)},€) ([Sil]---|l Sill Sr ha, £) > (81 ll S26) Sn oe) 


(S1 o£) ES, 10’ ,€') VuEV:E(o(aexpr1))(v)=é(o(aexpr2))(v) 
(S152 0,6) S, S2,07,8') (Caeetionrel a pai) 
(restore (Ader TCR es (Eee (a E Ej (E TO > (S,0,8) 
(store(V,aewpr);,0,€) "> (B, 0,0 (aexpr):=€(o(aexpr))[Veo]]) — ([BI|--- | E],0.€)——9(E,2.8) 


Fig. 2: Rules for operational semantics 


we assume that each basic statement is annotated with a label £, which must 
be unique in the complete program. Moreover, we use the set V to refer to all 
program variables and subset V(S) C V to refer to the variables occurring in 
(sub)program S. Similarly, subset V(expr) C V represents all variables that occur 
in an arithmetic or Boolean expression expr. 

While the programming language above is sufficient to represent original and 
refactored programs, the test programs derived by our approach also use check- 
pointing to store, restore, and compare relevant parts (e.g., modified variables) of 
program states. To support checkpointing and checkpoint comparison, we extend 
the programming language for test programs with the three checkpoint functions 
eq_store, restore, and store. All three functions get as input a subset V C V 
of relevant variables and one or two arithmetic expressions (typically an integer 
constant) to refer to the relevant checkpoints. 


S := eq_store(V, aexpr1, aexpr2); | restore(V, aexpr); | store(V, aexpr); 


Program Semantics We formalize the program semantics using a fairly 
standard operational semantics that defines how a program executes. A program 
execution is a sequence of transitions between execution states. An execution 
state is a triple of a program, a data state, and an additional checkpoint state. 
A data state is a function ø : V > Z that provides an integer value for each 
program variable. We denote the set of all data states by X. A checkpoint state 
is a function é : N > X that maps checkpoints i to data states ø. The set £ 
denotes all checkpoint states. 


The 12 rules shown in Fig. 2, which consists of 7 standard rules plus 5 newly 
introduced rules highlighted in light gray, define the possible transitions. As usual, 
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we write o(expr) for the evaluation of expr in data state ø € X.* The state 
update o[v := o(aexpr)], which is used in the rule for the assignment, returns a 
new data state on with o,(w) = o(w) for all w € V \ {v} and on(v) = o(aexpr). 
Similarly, the multi state update o[V < ø'], which is used by the new store and 
restore rules, returns a new data state on with o,(w) = o(w) for all w € V\V 
and oy(v) = o’(v) for all v € V. In addition, the checkpoint update €[c := oul, 
which is used in the store rule, returns a new checkpoint state £n with én (i) = €(7) 
for all i € N \ {c} and &,(c) = ou.” Also, note that instead of assuming that E S 
and [E]|... || E]S are equivalent to S, we introduce two nop rules, which make 
our proofs simpler. After we formalized the transitions, we now inductively define 
the executions ex(S) of a program S with two inference rules: 


oEv EES 
L ogee and 
oPn 


9, (S0s70.€0)— A (Sn on EnEn), (Sn0n En) E (Sayr Ont Ent) 

(59,00) >... 224 (Spon sEn) (Sn. .n41En41) €e2(S) 
We write (S,a,§) —* ($",0’,é') if the intermediate steps of the execution 
are unimportant. Furthermore, we say that execution (5,¢,£) >* (S’,0’,é’) 
(i) terminates normally if S’ = E and (ii) violates a checkpoint equivalence if 
S’ violates a checkpoint equivalence in (o’,€’). In general, a program S” vio- 
lates a checkpoint equivalence in (o’,€’) if either (a) there exists a statement 
Seq = eq-store(V, aexpr1, aexpr2); such that w € V : €'(o’(aexpr1))(v) # 
&'(o'(aexpr2))(v) and S = Seg or S = Seg” or (b) S = [Si]... || Sil] --- |] Sn] 
or S = [Sil]... [Sill Sn] S" and there exists at least one subprogram S; that 
violates a checkpoint equivalence in (o’,€’). In general, a program S violates a 
checkpoint equivalence if there exists an execution (S,a,€) >* (S’,0’,&) € ex($) 
such that S’ violates a checkpoint in (o’, €’). 

Partial Equivalence. We are interested whether two (sub)programs behave 
functionally equivalent, i.e., compute the same output when given the same 
input. Like many other approaches on equivalence checking, we focus on partial 
equivalence, i.e., we limit equivalence to executions that terminate normally.® 
In addition, we utilize that checkpoint functions are not used in programs, but 
are only introduced to test functional equivalence. Therefore, our definition of 
partial equivalence focuses on data states and ignores checkpoint states. 


Definition 1. (Sub)programs S1 and S2 are partially equivalent (S1 = $2) if 


Vo, ga" € DEEE EN = = : ((S1,0, £) >* (E, ae) = ex(S1) 
A(S2,0,€") >* (E, o", €") € ex(S2)) => a! =a" . 


4 Note that we do not specify the expression evaluation in detail because we have 
not fixed the expression syntax. However, we assume that the result of evaluating 
integer constant c in data state ø is the constant c (i.e., o(c) = c) and that expression 
evaluation is deterministic (i.e., o(expr) = x A o(expr)=y => r= y). 

The store rule determines the state ou using a multi state update and the index c 
evaluating an arithmetic expression (often a constant) in the current data state. 
Note that we still may detect that a refactoring introduces non-termination because if 
a refactoring introduces non-termination, our test program either detects inequivalence 
or does not terminate for some inputs. 


5 
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Variable Modification. To make equivalence testing more efficient, we only 
want to checkpoint modified variables, i.e., the checkpoint should only store 
the value of those variables whose value may change. The following definition 
formalizes the set of variables modified by a (sub)program. 


Definition 2. Let S be a (sub)program. The variables modified by S are: 


M(S) := {v € V | Jo, o' € XE E € E : (S,0,€) 3* (6,0, E) Aalu) Fa’ (v)}. 


For instance, in programs written in our programming language that do not 
use restore statements variables can only be modified by assignments. For those 
programs, the set M(S) of modified variables can be overapproximated by the 
set of variables that occur in S' on the left-hand side of an assignment. In the 
following, we describe any overapproximation of the modified variables, e.g. the 
one sketched above, by Mz : S — 2” and assume that M(S) C Mx(S). 


3 Generating Test Programs with PEQtest 


Our goal is to test equivalence between an original and refactored program, which 
both do not use checkpoint functions. As explained earlier, checkpoint functions 
are supposed to be used by test programs only. In this section, we describe how 
PEQTEST generates the test program for equivalence testing, prove soundness of 
the generated test program, i.e., show that the generated test program checks 
functional equivalence, and discuss limitations of PEQTEST’s program generation 
as well as our implementation. 

Sound Test Program Generation. To test functional equivalence of two 
subprograms, the idea of our PEQTEST approach is to execute both subprograms 
with the same input and compare their outputs. The test program generated 
by PEQTEST will execute the two subprograms sequentially to avoid that their 
executions can interfere with each other. Furthermore, it will ensure that both 
subprograms get equal inputs, which may be produced by the (original) program, 
and that their outputs can be compared. Many verification approaches for 
functional equivalence [17,38,23,51,42,2,19] use a similar setup, but do not restrict 
the inputs. To ensure equal inputs and make outputs available, these approaches 
either (1) duplicate (shared, modified) variables, replace the variables in one of 
the subprograms by the duplicated ones, and assign equal values to the original 
and duplicated inputs [17,42,2,19], (2) add additional variables to store the 
input and output values and restore the input after the execution of the first 
subprogram [51,23], or (3) use dedicated functions, e.g., checkpoint functions, 
to store and restore inputs and outputs [38]. For our test program, we choose 
option (3) because it does not change the subprograms and, thus, it simplifies test 
program generation as well as it eases the comprehensibility of the test program. 

Next, we discuss how we implement option (3). To lower the test effort, we 
decide to only store and compare values of variables that may be modified by 
one of the two subprograms. Since this set cannot always be determined precisely 
and different overapproximations are imaginable, we use parameter V to provide 
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this set to the test program generator. Moreover, we aim at localized equiva- 
lence testing. Thus, our test program likely includes more than one functional 
equivalence test, namely one for each pair of original and refactored subprogram. 
While the output must be stored directly after the execution of each subprogram, 
the output comparison can be done at the end of the test program or after the 
execution of original and refactored subprogram. We choose the second option 
because it allows us to reuse checkpoints and lets the test program stop at the first 
difference of outputs, which makes it easier to detect which pair of subprograms 
is responsible for the failure of the test program, i.e., which pair of subprograms 
is inequivalent. We stop at the first difference instead of e.g. logging the difference 
because test execution becomes faster, but we address the logging alternative 
when discussing the limitations. The following definition shows how we encode 
the functional equivalence test for an original subprogram $1 and the refactored 
subprogram $2 for a given overapproximation V of the set of modified variables. 


test_eq(V, S1, $2) := 
store(V,0); S1 store(V,1);restore(V,0); S2 store(V, 2); eq_store(V, 1, 2); 


Next, we show that our test encoding is sound, i.e., it may detect inequivalences 
if the two subprograms S1 and $2 are inequivalent. Our encoding uses checkpoint 
equivalence to detect whether two subprograms S1 and S2 are inequivalent, i.e., 
differ in their outputs. Hence, it must violate a checkpoint equivalence if S1 and 
S2 are inequivalent. We can ensure even more and show that the test encoding 
is also complete. As shown by the following theorem our test encoding violates a 
checkpoint equivalence if and only if S1 and $2 are inequivalent. 


Theorem 1. Let S1 and S2 be (sub)programs without calls to checkpoint func- 
tions and Mx be an overapproximation of the modified variables. Then, S1 = $2 
iff test_eq(Mxz(S1) U Mx( S2), S1, S2) does not violate a checkpoint equivalence. 


Proof (Sketch). Let M := Mxz(S1)U Mx (S2). 

=> Let (test_eq(M, S1, S2),0,€) >* (eq_store(M,1,2);,05,&) be arbitrary. 
Show with semantics that there exists an execution 
(test_eq(M, S1, $2), 0, £) 
— (S1 store(M,1);restore(M,0); S2 store(M, 2); eq_store(M,1,2);,01,&1) 
—* (store(M,1);restore(M,0); S2 store(M, 2); eq_store(M, 1,2); , 02, £2) 
—* (S2 store(M, 2); eq-store( M, 1,2); , 03, £3) 
—* (store(M, 2); eq_store(M, 1,2); , o4, £4) 
— (eq-store(M, 1,2); , o5, £5) 
with o = 01 = 03, for all v € V \ M also o(v) = o5(v), and for all v € M we 
have £5(1)(v) = o2(v) and €5(2)(v) = o4 (v). 
Conclude that exists (S1, 01,1) >* (E, 02,2) and (S2, 03,3) >* (E, 04, £4) 
with o = 0; = g3 and for all v € M we have €;(1)(v) = oz and €5(2)(v) = o4. By 
assumption (S1 = $2), o2 = g4 and, thus, €5(1)(v) = o2(v) = o4(v) = ¿s (2) (v). 
By semantics, (test-eq(M, S1, 82), o, £) —>* (eq_store(M,1,2);,05,& 5) does not 
violate a checkpoint equivalence. 

< Let (S1, 01,1) >* (E, 02, £2) and (S2, 03, £3) >* (E, 04, £4) be arbitrary 
with c1 = 03. Show with semantics that there exists an execution 
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(test_eq(M, S1, S2), 0, £) 

— (S1 store(M, 1); restore(M,0); S2 store(M, 2); eq_store(M,1,2);,01,&1) 
—* (store(M,1);restore(M,0); S2 store(M, 2); eq_store(M, 1,2); , 02, 2) 
—* ($2 store(M, 2); eq_store(M, 1,2); , 03, &3) 

—* (store(M, 2); eq_store(M, 1,2); , 04, £4) 

— (eq-store(M,1,2);,05,&5) 

with o = 0; = 03, for all v € VY \ M also o(v) = o5(v), and for all v € M we 
have €5(1)(v) = o2(v) and &5(2)(v) = o4(v). 

Since the test program does not violate a checkpoint equivalence, for all v € M 
we know 02(v) = &5(1)(v) = 5(2)(v) = o4(v). We conclude that o2 = 04. 


So far, we can use the test encoding to test or even verify functional equivalence 
of complete programs. Following the idea of PEQCHECK [19], which checks 
equivalence on the level of subprograms rather than on the level of functions or 
programs, our goal is to split testing of equivalence into multiple subtests, namely 
one subtest per pair of original and refactored subprogram. While PEQCHECK 
builds one equivalence task per pair and verifies all tasks on every input, our 
PEQTEST approach generates one single test program that only provides inputs 
produced by the original program’. More concretely, PEQTEST derives the test 
program from the original program by replacing the subprograms being refactored 
with the test encoding test_eq of the original and refactored subprogram. 

Currently, we assume that PEQTEST is informed about the refactored subpro- 
grams. More concretely, given original program S and refactored program S”, we 
assume that there exists a partial, injective replacement function y : 25 — 258 
such that S’ can be derived from S by replacing all subprograms Sı of S 
with Sı € preImg(y) by y(S1). Generally, we write S2 = I'(S1,7) to de- 
note that $2 is derivable from $1 by replacing all subprograms S. of $1 by 
7(S,). For the PEQTEST approach, we assume that the replacement function + 
only describes the refactoring of the original program S, i.e., prefmg(y) only 
contains subprograms of S. In addition, the replacement must be unambigu- 
ous. Hence, we do not allow 51,52 E€ preImg(y) such that S2 is a subpro- 
gram of Sı nor S1, S152 € img(y) such that Sı is a subprogram of S and 
Sı ¢ preImg(y).° We also require that E,[E||...||E] ¢ (prefmg(7) U img(y)) 
and 74S: E S,|E||...||E|S € (prelmg(y) Uimg(y)) because they are no proper 
programs. To avoid that interference of parallel statements can invalidate the 
result of a test, all subprograms in preImg(7y) (img(y)) must not occur in a 
parallel statement of the original (refactored) program. Thus, a refactoring in a 
parallel statement must be described by a refactoring of the parallel statement. 
Note that for proper programs one can always use y = {($, S’)}.1° 

To generate our test program, PEQTEST requires a replacement function Yrest 
that maps the subprograms being refactored to their test encodings. PEQTEST 


7 Tf all original and refactored subprograms are equivalent (which we aim to inspect), 
the original and refactored program will provide the same inputs. 

8 If y is not injective, one can make it injective by properly changing statement labels. 

? One can achieve this by proper choices of code segments and statement labels. 

10 However, one may need to adapt some of the labels in S’. 
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derives test from the replacement function y, which describes the refactoring. 
For each subprogram in the domain, PEQTEST replaces its image (the refac- 
tored subprogram) by the test encoding of that subprogram and its refactored 
subprogram thereby using an Myx to determine the set of modified variables. 


Yrest (y, Mx) := {(S1, test_eq(M~x(S1) U Mz(7(S1)), $1, 7($1))) | S1 € preImg(y)} 


Let us briefly discuss why yest fulfills the requirements on a replacement func- 
tion. Since the test encoding contains y($1), function yest inherits injectivity 
from y. By construction, test encodings are unequal to E, E S, [E||...||E], and 
[E||...||E)S and start with checkpoint functions, which we assume that the orig- 
inal program does not contain. The remaining requirements are fulfilled because 
we only replace refactored subprograms by the corresponding test encoding. 
Now, we have everything at hand to generate the test program, which can then 
be used to detect inequivalences with an existing test approach, e.g., [12,1]. As 
explained, we derive the test program from the original program by replacing the 
subprograms being refactored with the test encoding test_eq of the original and 
refactored subprogram. To achieve this, we use the replacement function Yrest- 


test_prog(S,y, Mz) := T (S, Yrest, Mz)) 


Again, let us consider soundness, but now for the test program. Our goal is 
to detect inequivalences caused by a refactoring. Thus, we do not give any 
guarantees if the original program is non-deterministic, i.e., not equivalent to 
itself, which can only occur if it contains non-deterministic parallel statements 
or checkpoint functions. We already assumed that checkpoint functions are only 
used by the test program, but not by the original or refactored program. For our 
soundness discussion, we also exclude programs that contain non-replaced, non- 
deterministic parallel statements. More concretely, we assume that all parallel 
statements Sp that are not replaced, i.e., for whom there does not exist a 
subprogram S, € prelmg(7y) such that Sp = Ss or Sp is a subprogram of Ss, are 
deterministic (Sp = Sp). In this case, the following theorem ensures that our 
PEQTEST approach can soundly detect inequivalences, i.e., the test program 
generated by PEQTEST is able to detect a violation of a checkpoint equivalence 
if original and refactored program are inequivalent. 


Theorem 2. Let S and S be programs without calls to checkpoint functions, 
Myx an overapproximation of the modified variables, y be a replacement function 
such that S = I'(S,y), and all non-replaced parallel statements Sp of S are 
deterministic (Sp = Sp). If S # 8", then there exists (So, 00,60) >* (Sn, On; En) € 
ex(test_prog(S,y, Mx)) that violates a checkpoint equivalence. 


Finally, let us look at the contraposition of the above theorem. While our 
intention for PEQTEST is testing and detection of equivalence violations, the 
corollary below states that we can alternatively verify the test program generated 
by PEQTEST to show functional equivalence. 
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Listing 1.4: Original program Listing 1.5: Refactored program 
void swapi-otig (int x, int y) void swapi-mod(int x, int y) 


- { 
SSS ed tmp 
y=x; y=x; 
x=tmp+1; 
7 } 


Fig. 3: Behaviorally equivalent original and refactored program whose code seg- 
ments are not equivalent 


Corollary 1. Let S and S' be programs without calls to checkpoint functions, 
Myx an overapproximation of the modified variables, y be a replacement func- 
tion such that S = T(S,y), and all non-replaced parallel statements Sp of 
S are deterministic (Sp = Sp). If no execution (So, 00,0) >* (Sn, an,n) € 
ex(test_prog(S,7, Mx)) violates a checkpoint equivalence, then S = S’. 


Discussion of Limitations. Functional equivalence of two programs is un- 
decidable [17]. While our PEQTEST approach is sound under certain assumptions. 
PEQTEST may report violations of checkpoint equivalences, although original 
and refactored program are equivalent. Hence, it may be incomplete. One reason 
is the wrong choice of code segments. For example, consider Fig. 3. Although 
the two code segments of original and refactored program (highlighted in blue 
and green, respectively) are inequivalent, the programs are equivalent. For our 
experiments, we ensured that we do not make the wrong choice for the code 
segments. In practice, one may check whether a reported violation is a false alarm 
caused by a wrong choice of code segments by reusing the test input causing 
the violation to execute one or more test programs generated by PEQTEST that 
use the same original and refactored program but larger segments, e.g., using 
segments on function or program level, or iteratively merging segments until the 
violation is disproved or the segments become the programs. 

Next, let us discuss the assumption used in Theorem 2. One can easily get rid 
of the assumption that non-replaced parallel statements must be deterministic. 
Basically, PEQTEST needs to extend y with pairs (Sp, Sp) for all non-replaced 
parallel statements Sp. Supporting checkpoint functions is more challenging 
because PEQTEST must be able to store and restore checkpoints and it must 
ensure that its checkpoints and the program’s checkpoints do not interfere. While 
one may find such an encoding, our definition of partial equivalence does not 
cover checkpoint states. Also, it does not support non-deterministic programs 
since our main motivation for PEQTEST is refactoring or parallelization of 
sequential programs not the refactoring of non-deterministic, parallel programs. To 
properly support checkpointing and all kinds of parallel programs, our definition 
of equivalence and PEQTEST need to be adapted significantly. 

Also, the requirements on the replacement function restrict our PEQTEST 
approach. While many assumptions can be met by adapting labels of statements, 
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the requirement that code segments must be subprograms and they must not 
occur in a parallel statement are major restrictions. However, note that this only 
limits the granularity of code segments, but not the applicability of the approach. 

Finally, we want to mention that in our above formalization we chose to stop 
as soon as PEQTEST finds a violation because it simplified our proofs. To always 
inspect all refactored code segments, one can either move PEQTEST’s checks at 
the end of the test program and use different checkpoints per test encoding, or 
only write a log but do not stop when detecting a difference. To ensure that one 
still tests on values of the original program, one must restore the output of the 
original program at the end of each test encoding or swap S1 and S2 in the test 
encoding test_eq, i.e., execute the refactored subprogram $2 before the original 
subprogram $1. Our current implementation postpones PEQTEST’s checks to 
the end of the test program and restores the output of the original program at 
the end of each test encoding. 

Implementation. We support test program generation for a subset of C pro- 
grams with or without OpenMP directives. So far, we do not support programs 
with pointer aliasing (except for parameter passing). While we allow pointers 
and dynamic memory allocation, we do not support the modification of dynamic 
data structures in original or refactored code segments. The reason is that we 
checkpoint arrays and structs by recursively checkpointing their elements and 
checkpoint pointers by dereferencing them and then checkpoint the dereferenced 
non-pointer element. Thus, our current implementation only works correctly in 
case that pointers that need to be checkpointed are non-null and do not change 
in original or refactored code segments.'? 

Our test program generation relies on the ROSE compiler framework [37]. To 
store and restore checkpoints, we use a minicpr library, but we built our own 
library to compare checkpoints. Our implementation assumes that the start and 
end of a code segment i is specified by pragma statements #pragma scope_i and 
#pragma epocs_i. Currently, we insert them manually. For OpenMP paralleliza- 
tion (our main field of application), insertion is mostly straightforward. Often, 
choosing the code blocks associated with the outermost OpenMP directives is a 
good choice. This can easily be automated, but has not been implemented yet. 

For each code segment, our implementation runs ROSE’s definition-use analy- 
sis to detect the modified variables Mz that are visible after the code segment. If 
a code segment contains procedure calls, we also add all global variables and all 
variables occurring in the parameter expression of a pointer or array argument to 
the modified variables Mx. Based on the computed set Mx of modified variables, 
we then extend the sequential code segment with the refactored code segment 
and the calls to the checkpoint library necessary to store and restore checkpoints. 
In contrast to our formalization, the store and restore operation only get the 
checkpoint name, while additional calls are used to inform the checkpoint library 
which variables V to consider. Also note that the test program generated by our 
implementation stores the output of the original and refactored code segments 


11 Due to internals of the used checkpoint library, pointers must not change after they 
are first checkpointed. 
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in checkpoints that differ for each execution of a test encoding and performs 
output comparison at the end of the test program, which allows us to inspect all 
checkpoints at once and to possibly find multiple violations. 

Next, we describe the checkpoint comparison. For each variable in the two 
checkpoints!*, we check whether their content is equivalent. Except for floating 
point values, we rely on C’s byte level comparison function memcmp. Often, 
implementations of floating point operations like + are not associative, but small 
differences of floating point values are tolerable. Thus, our comparison of floating 


point values succeeds when the difference of the values is within a tolerance e!°. 


4 Evaluation 


The goals of our experiments are to (a) study how effective and efficient is PEQ- 
TEST’s detection of inequivalences and to (b) compare PEQTEST to an existing 
equivalence checker. For our comparison, we choose PEQCHECK because it also 
supports localized checking for OpenMP programs. 


4.1 Experimental Setup 


Benchmark. To check equivalence of sequential and parallelized programs, we use 
the tasks from the DataRaceBench (DRB) benchmark suite [24,50] (version 1.3.2), 
which addresses common mistakes in OpenMP parallelization and contains 
OpenMP programs with and without data races. From the DataRaceBench, we 
exclude all tasks with thread private directives, which we cannot cover with 
our segments and all tasks that require at least an OpenMP 4.5 compiler or 
that offload computation to a different device (i.e., use the target construct) 
because they are neither supported by PEQTEST nor PEQCHECK. In total, we 
get 132 tasks (26 equivalent and 106 inequivalent tasks). We manually selected the 
code segments following the idea discussed in the implementation paragraph and 
use the DataRaceBench programs without OpenMP constructs for the sequential 
(original) programs. To execute the generated test programs, we use the inputs 
provided by DataRaceBench. 

To check equivalence of two sequential program versions, we consider all 
non-recursive programs from Réve [15]. However, we exclude loop4 and loop5, 
which were not available, as well as digits10, digits!10, and barthe2, which declare 
different sets of output variables in original and refactored program and, thus, 
are detected inequivalent during test program generation. To make the programs 
executable, we remove the mark annotations, which have no implementation, and 
extend each of the programs with a test driver that generates random inputs. 
The code segments are the same as in the evaluation of PEQCHECK [19]. In total, 
we get 15 sequential tasks (5 equivalent and 10 inequivalent tasks). 

Tool Configurations. To study the trade-off between effectiveness and 
efficiency, we examine three PEQTEST configurations, which differ in the resources 


12 By construction, checkpoints that are compared store the same variables. 
13 Tn our evaluation, we use € = 1078. 
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used during test program execution. The low effort configuration uses one thread 
and runs the test program once. The other two configurations use two threads 
for the DRB tasks and one thread for the sequential tasks while running the 
test program 10- and 100-times. For the competitor PEQCHECK [19], we use a 
setup similar to [19]. For the DRB tasks, PEQCHECK combines the PEQCHECK 
encoding“ (revision 9dc36b) and verifier CIVL [42] (version 1.20_5259) using the 
theorem prover Z3 [27] (version 4.8.7). We restrict CIVL to two threads, set its 
timeout to 5min, and disable the division by zero and memory leakage checks. 
For the sequential tasks, PEQCHECK combines the PEQCHECK encoding with 
verifier CPACHECKER [7] (version 2.0). For verification, we use CPACHECKER’s 
default analysis, which is also limited to 5 min. 

Environment. We use a time limit of 5 min per task and run our experiments 
on an Ubuntu 20.04 machine with an Intel Core i7 (1.8 GHz) and 32 GB of RAM. 


4.2 Experiments 


RQ 1: How effective is PEQtest with minimal resources? To answer this 
research question, we look at PEQTEST’s results for the low effort configuration 
(1 thread, 1 run). For the DataRaceBench (DRB) tasks (left) and the sequen- 
tial (SEQ) tasks (right), Tab. 1 shows for all three PEQTEST configurations the 
absolute and relative number! of correctly detected inequivalences, the number 
of missed inequivalences, i.e., inequivalences that are not detected, the number of 
equivalent tasks for which an inequivalence is incorrectly detected (i.e., the false 
alarms), and the number of equivalent tasks for which no inequivalence is detected. 
For the two classes in which no inequivalence is detected (missed inequivalence or 
correctly detected no inequivalence), we also distinguish between the two reasons 
for not detecting inequivalences: (1) no inequivalences are reported during test 
program execution and (2) task not completed, e.g., test program generation 
failed or a timeout occurred during test program generation or execution. 
Looking at the first two columns of the DRB tasks and the two columns of 
the SEQ tasks in Tab. 1, which show the results of the low effort configuration, 
we observe that for our examples PEQTEST does not report any false alarms, 
i.e., the number of incorrectly detected inequivalences is zero. Thus, we have 
100% precision for inequivalence detection. More surprisingly, PEQTEST detects 
more than half of the inequivalences (i.e., recall > 50%) with its low effort 
configuration and, thus, without parallel execution in case of the parallelized 
DRB tasks. Studying the detected inequivalences, we observe that almost all the 
detected inequivalent DRB tasks use a variable to which data-sharing attribute 
(first)private is assigned and that is visible, but typically not live after the 
parallelized code segment. The data-sharing attribute makes the variable thread- 
local during execution of the parallelized code segment and prevents that the 
thread-local variable values become available after the parallelized code segment. 


https: //git.rwth-aachen.de/svpsys-sw/FECheck 
15 The relative numbers are the absolute numbers divided by the total number of 
equivalent and inequivalent tasks, respectively. 
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Table 1: For each of the three PEQTEST configurations, shows for the DRB 
and sequential (SEQ) tasks the absolute and relative number of tasks for which 
inequivalence is detected correctly, is missed, is detected incorrectly, and is 
correctly not detected. If no inequivalence is detected, the table also distinguishes 
between no inequivalence reported (i.e., no inequivalence observed in runs) and 
task is not completed due to a timeout or failure. 


DRB tasks SEQ tasks 
1 thread 2 threads 1 thread 
1 run 10 runs 100 runs |1/10/100 runs 
correctly detected inequivalence 58 (55%) 72 (68%) 74 (70%) |6 (60 %) 
missed inequivalence 48 (45%) 34 (32%) 32 (20%) |4 (40 %) 
no inequivalence reported 38 (35 %) 24 (22 %) 22 (17 %) 3 (30 %) 
task not completed 10 (10 %) 10 (10 %) 10 (10 %) 1 (10 %) 
incorrectly reported inequivalence 0 (0%) 0 (0%) O (0%) 0 (0%) 
correctly detected no inequivalence 26 (100%) 26 (100%) 26 (100%) |5 (100 %) 
no inequivalence reported 22 (85 %) 22 (85 %) 22 (85 %) 4 (80 %) 
task not completed 4 (15 %) 4 (15 % 4 (15 %) 1 (20 %) 


Furthermore, many of the detected inequivalent sequential tasks are inequivalent 
for many different input values. We conclude that inequivalences caused by the 
discussed data-sharing attributes or input-insensitive inequivalences can easily 
be detected with a single run and thread. 


RQ 2: Does PEQtest’s effectiveness increase when given more 
resources and what are the costs? First, we examine whether PEQTEST 
performs better if we increase the resources for testing, i.e., the number of 
runs and for parallelized programs also the number of threads used during test 
program execution. Comparing the results of our three PEQTEST configurations 
(Tab. 1), we observe that there is no difference for the sequential tasks. The 
reason is that one can only detect the missed inequivalences with particular 
inputs whose random generation is unlikely. For the DRB tasks, however, the 
number of correctly detected inequivalences increases and the number of missed 
inequivalences decreases when providing more resources. All other entries stay 
the same. Hence, PEQTEST’s effectiveness may increase (i.e., its recall increases) 
when we allow it to use more resources. Especially, using more than one thread 
for parallelized programs increases the effectiveness significantly, as one could 
have expected. For our examples, using 100 instead of 10 runs hardly improves 
PEQTEST’s effectiveness. In general, PEQTEST misses inequivalences in the 
DRB tasks if the generation of the test program fails (10 tasks). In addition, it 
misses inequivalences for SIMD constructs (2 tasks), inequivalences depending on 
thread scheduling (13 tasks), and inequivalences in I/O behavior (7 tasks), e.g., 
values written via printf, which our implementation does not support yet!®. 


16 Support for I/O can be added by writing all outputs to the checkpoint. 


10000 


. Jakobs, M. Wiesner 


H 
° 
© 
[s 


10000 


e 
z] 
S 
t=] 


S 
i=} 
[=] 


e 


Total time multiple runs [s] 
b 
Oo 


o 
B 


100 runs È 100 runs B 
10 runs a 
1 


Test execution time multiple runs [s] 
8 


10 runs a 
L 


Ti 
1 10 100 1000 
Total time 1 thread, 1 run [s] 


kad 
o 
= 


0.01 L L 1 L 
0.01 0.1 1 10 100 1000 


Test execution time 1 thread, 1 run [s] 


10000 10000 


2: 
[e] 
B 
o 
B 


Fig. 4: Per task compare execution time of all test program runs (left) and total 
runtime of PEQTEST (right) in low effort configuration (1 thread, 1 run) against 
the other two configurations (2 threads for DRB tasks and 1 thread for sequential 
tasks, and 10 (A) or 100 (W) runs) 


Second, we examine the costs for increasing PEQTEST’s resources for test 
program execution. To this end, we look at the execution times PEQTEST 
consumes for all test program runs and the total execution time (test program 
generation and execution). Figure 4 compares for each task that does not belong to 
one of the task not completed categories the times for the low effort configuration 
(1 thread, 1 run, x-axis) with the other two configurations of PEQTEST. As one 
could have expected, the scatter plot on the left-hand side of Fig. 4 shows that 
the execution times for the test programs scale linearly with the number of runs. 
A similar behavior can often be observed when the total time of the low effort 
configuration is not dominated by the test program generation (> 3s). 

In summary, providing more resources often increases PEQTEST’s effective- 
ness while causing at most a linear increase of runtime costs. In particular for 
parallelized tasks, using more than one thread is beneficial. However, we re- 
quire many runs of the generated test program to find schedule-dependent or 
input-sensitive inequivalences. 

RQ 3: How does PEQtest compare against state-of-the-art? We 
compare PEQTEST’s configuration using 100 runs with equivalence checker PEQ- 
CHECK [19], which also performs localized checks, but relies on verification. Since 
PEQTEST’s and PEQCHECK’s definition of functional equivalence differ (PEQ- 
TEST considers all variables, while PEQCHECK only considers live variables), we 
restrict the comparison of PEQTEST and PEQCHECK to those 72 DRB tasks 
and 8 sequential tasks that (1) are either equivalent or inequivalent for both 
notions of equivalence and (2) in which the code segments affect at least one 
variable that is live afterwards. 

Table 2 shows the results of PEQTEST and PEQCHECK on the restricted 
benchmark. The structure of Tab. 2 is the same as Tab. 1. Looking at Tab. 2, we 
first observe that both approaches do not incorrectly detect an inequivalence, i.e., 
they do not report false alarms. Hence, the precision for inequivalence detection is 
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Table 2: For PEQTEST and PEQCHECK, shows for the DRB and sequential (SEQ) 
tasks the absolute and relative number of tasks for which inequivalence is detected 
correctly, is missed, is detected incorrectly, and is correctly not detected. If no 
inequivalence is detected, also distinguishes between no inequivalence reported 
and task is not completed due to a timeout or failure. 


PEQTEST PEQCHECK 
100 runs 
DRB SEQ DRB SEQ 
correctly detected inequivalence 89 (74%) |2 (67%)| 3 (6%) |3 (100%) 
missed inequivalence 14 (26%) |1 (%) |50 (94%) |O (0%) 
no inequivalence reported 11 (21 %) 0 (33 %) 1 (2 %) 0 (0 %) 
task not completed 3 (5 %) 1 (33 %) 49 (92 %) 0 (0 %) 
incorrectly detected inequivalence 0 (0%)lo (0%)|0 (0%)ļl0 (0%) 
correctly detected no inequivalence 19 (100%) |5 (100%) |19 (100%) |5 (100%) 
no inequivalence reported 15 (79 %) 4 (80 %) 5 (26 %) 4 (80 %) 
task not completed 4 (21 %) 1 (20 %) 14 (74 %) 1 (20 %) 


100 %. For the sequential tasks, PEQCHECK detects one additional inequivalent 
task, for which PEQTEST times out. In contrast, PEQTEST detects significantly 
more inequivalent DRB tasks (i.e., has a higher recall) and, thus, misses less 
inequivalent DRB tasks. An important reason for the lower recall of PEQCHECK 
is that PEQCHECK’s inspection fails in 87.5% of the DRB tasks. The major 
failure causes are timeouts (30%), missing support for OpenMP constructs in 
the verifier CIVL (31%), and the detection of violations that are unrelated to 
functional equivalence, e.g., array out of bounds accesses in a verification task, 
which is generated by PEQCHECK to check functional equivalence. Despite PEQ- 
CHECK’s worse performance, it can verify the task DRBO76-flush-orig-no.c, 
for which PEQTEST failed. Finally, we remark that although PEQTEST has a 
higher time limit than PEQTEST (namely, 5 min per run instead of 5 min per 
verification task), there exist only two tasks in which PEQTEST requires more 
than 5 min in total and PEQCHECK could have profited from a higher time limit. 

Summing up, PEQTEST is typically a better choice than PEQCHECK when 
aiming to find inequivalences. In particular, PEQTEST profits from relying on 
compiler support of OpenMP constructs and from checking equivalence only for 
the test inputs. Thus, PEQTEST is well-suited for inequivalence detection, but in 
contrast to PEQCHECK, which considers all inputs, it rarely proves equivalence. 


5 Related Work 


Approaches inspecting functional equivalence aim at proving equivalence or 
detecting behavioral differences. Alternatively, they characterize for which inputs 
equivalence is ensured. 
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Proving Functional Equivalence. Approaches proving functional equiva- 
lence may use relational verification [6,5],(bi)simulation relations [56,40,14,13], or 
domain-specific checks [55,9,10,18,25]. Other approaches transform the programs 
into models and check model equivalence [43,49,41]. ARDiff [4] compares sym- 
bolic summaries and Réve [15] translates the equivalence into Horn constraints. 
Several approaches [17,38,23,51,42,2,19] encode equivalence checking into pro- 
grams. Their encoding idea is similar to PEQTEST’s encoding of the functional 
equivalence tests. The closest encoding is the encoding of UCKLEE [38], which 
also use checkpointing, while the other approaches duplicate variables. Despite 
similar encodings, these approaches do not test, but verify the generated pro- 
grams. A further difference is that they typically generate more than one program, 
namely one per changed unit (program [42], function [17,38,23,51], or refactored 
code segment [2,19]). Each generated program only consists of the functional 
equivalence check of the respective unit and typically considers all possible inputs. 
In contrast, PEQTEST embeds the equivalence tests into the original program 
and only considers inputs produced by the original program. 


Difference Detection. Relative debugging [3] executes the original and 
refactored program in parallel and compares the values of user-defined variables 
or data structure at user-defined program locations, which is more fine-grained 
than functional equivalence. Nevertheless, several techniques focus on detecting 
differences of the functional behavior. Differential monitoring [28] applies runtime 
monitoring that runs two programs, e.g., original and refactored program, in 
parallel, distributes any input to both programs, compares their outputs, and 
forwards equivalent outputs to the environment, while aborting in case of inequiv- 
alence. Following the idea of differential testing [26], BERT [20], shadow symbolic 
execution [31], and HyDiff [29] generate tests and execute the generated tests on 
original and refactored program to detect differences in the behavior. BERT [20] 
generates inputs to cover the changed code parts. Shadow symbolic execution [31] 
uses a more advanced test generation that is steered towards internal behavioral 
differences. HyDiff [29] combines shadow symbolic execution with fuzzing, using 
the tests from the shadow symbolic execution to steer the fuzzer AFL. In contrast, 
Qi et al. [36] and eXpress [47] directly aim at generating difference revealing 
tests. To this end, they steer the test generation to find test inputs that reach a 
change that affects the output. While the previous techniques use special test 
generators, Diffut [52] and DiffGen [46] rely on standard test generators. Diffut 
keeps shadow variables for the original program in the refactored program, wraps 
the method of the original program to extend it with equivalence checks, and uses 
JML annotations to force the execution of the wrapped method of the original 
program while testing the refactored program. DiffGen [46] generates one test 
driver per changed method that copies the input, executes original and refactored 
method with original and copied input, respectively, and contains one check per 
output. DiffGen’s encoding idea is similar to PEQTEST’s encoding of functional 
equivalence tests, but PEQTEST focuses on refactored code segments. 


Semantic Characterization of Differences. To provide more information 
in case of non-equivalence, a few approaches compute or (under)approximate the 
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condition when original and refactored program are equivalent. To this end, they 
use symbolic execution [34,48], abstract interpretation [21,32,33], or testing [11]. 


6 Conclusion 


While refactorings are necessary to improve software quality, correct refactoring, 
i.e., a refactoring that does not change the functional behavior of the software, 
is challenging. Several solutions have been proposed to detect that refactored 
programs alter the behavior, some of them compare the functional behavior of 
original and refactored programs. 

Approaches checking functional equivalence often use heavyweight (formal) 
verification. Furthermore, difference detection approaches frequently use dedi- 
cated test generators and execute (some of) the non-modified code twice, once for 
the original and once for the refactored program (function). To overcome these 
restrictions, we propose PEQTEST, which can be used to test (the intended appli- 
cation) or to formally verify functional equivalence. The test program generated 
by PEQTEsT—for which we proved that it checks functional equivalence—allows 
us to rely on compiler support, e.g., for OpenMP, to reuse existing tests or test 
generators, and at the same time to utilize that refactorings are often local, thus, 
avoiding to execute non-modified code more than once in each test program 
execution. To this end, PEQTEST replaces each refactored code segment in the 
program, e.g., a parallelized code segment, by a local check that inspects the 
equivalence of the corresponding original and refactored code segment. 

We implemented PEQTEST and evaluated it with the DataRaceBench bench- 
mark suite and sequential refactorings already used to evaluate other functional 
equivalence checkers. Our experiments show that PEQTEST detects many of the 
inequivalent tasks, e.g., incorrectly parallelized tasks, using a limited amount of 
resources, while reporting no false alarm. A comparison with the state-of-the-art 
equivalence checker PEQCHECK reveals that PEQTEST often performs better. 
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Abstract We present a new approach on how to provide institution-based semantics 
for communicating UML state machines in form of a hybrid modal logic Mb. 
A theoroidal comorphism maps Ms, into the Cast institution. This allows for 
symbolic reasoning on communicating UML state machines. 


1 Introduction 


In line with a long-standing line of research [5,6,15,4], we set out on a general programme 
to bring together multi-view system specification with UML diagrams and heterogeneous 
specification and verification based on institution theory, giving the different system 
views both a joint semantics and richer tool support. Institutions, a formal notion of 
a logic, are a principled way of creating such joint semantics. They make moderate 
assumptions about the data constituting a logic, give uniform notions of well-behaved 
translations between logics and, given a graph of such translations, automatically give 
rise to a joint institution. 

UML state machines are an object-based variant of Harel statecharts. Within the 
UML, state machines are a central means to specify system behaviour. In previous work 
[16], an institutional semantics for UML state machines was provided that allowed for 
symbolic reasoning. Such symbolic reasoning can be of advantage as, in principle, it 
allows to verify properties of UML state machines with large or infinite state spaces. 
Here, we extend this work in order to cater for communication. 

A typical scenario for such communication is the interaction between a User, an 
ATM, and a Bank in order to authenticate the User as legitimate owner of a bank card by 
checking an entered PIN. Figure | depicts a UML modelling for this scenario. In brief, 
the system consists of the ATM and the Bank, where we consider User interaction as an 
external communication. The scenario begins with the User entering a bank card and a 
PIN. The ATM requests their verification by the Bank. The Bank checks validity of the 
card/PIN combination and communicates the result to the ATM. We model the validity 
check as internal, non-deterministic choice made by the Bank. In case of a positive result, 
the ATM will return the card to the User. In case of a negative result, the User is given a 
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userCom If bankCom atmCom Zl 
Q D atm : ATM bank : Bank 
User 


bankCom.reenterPIN 
[trialsNum >= 3]/ 
userCom.keepCard0; 
trialsNum = 0 


userCom.card(c)/ 
c 


userCom.PIN(p)/ 
=t ing 


PINEntered 


bankCom.reenterPIN 
[trialsNum < 3]/ 
trialsNum = trialsNum+1 


/bankCom.verify(cardld, pin) 


atmCom.verify / 
wasVerified = 0 


/ wasVerified=1 


JuserCom.ejectCard; trialsNum=0 


/ atmCom.verified / reenterPIN 


bankCom.verified/ 


VeriSuccess 


Figure 1. UML diagrams for the ATM example (implicit completion events omitted): Composite 
structure diagram: top; state machine: left ATM, right Bank. 


second and third chance to enter a correct PIN. After the third verification failure, the 
ATM will keep the card. A typical question on this model is whether the ATM will consider 
the verification successful only if the Bank has already come to the same conclusion. To 
answer such questions, one needs to take into account the behaviour of all state machines 
involved as well as how they can communicate via the ports and connectors as specified 
by a composite structure diagram. 

Closest to our approach are the works [6,4]. Both these papers address the topic 
of communicating state machines, however, both fail to provide institutions of state 
machines as reported in [15,16]. Learning from the reason for this shortcoming, rather 
than capturing UML state machines directly as an institution, [16] builds up a new 
logic in which UML state machines can be embedded. Here, we extend this logic for 
communication. In particular, we treat UML event pools as part of composite structure 
diagrams rather than of state machines. State machines are seen as a completely open 
system, which is (partially) closed by ‘wiring up’ in a communication structure. Overall, 
this leads to a separation of concerns: event pools and transitions can be analysed 
independently. 

A number of authors give formal semantics to communicating state machines, however 
with a purpose different from symbolic analysis of UML. The Object Management Group 
provides an executable semantics of UML Composite Structures [14]. Their objective is to 
provide an interpreter for the executable subset f(UML of the UML. Dragomir [12] define 
transformations from composite structure diagrams to communicating extended timed 
automata for the purpose of simulation, static analysis and model-checking. Mazzanti et 
al. [8] provide a UML model checker that also covers composite structure diagrams. A 
quite comprehensive formal semantics has been provided by Liu et al. [7], again with the 
main purpose of supporting model checking. 

In Section 2, we recall the notion of an institution and sketch the CFOL™ institution 
of Cast, which we use for specifying data. In Section 3, we extend the hybrid modal 
logic M$ [16] to cater also for output by adding the notion of messages (in [16] with 
input only). For structures and formulae this requires us to introduce relativisations with 
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regards to a set of outputs. We show that the extended logic M4 is an institution, can 
be embedded into Cast via a theoroidal comorphism, and allows for “borrowing” of 
Cast theorem proving support. In Section 4, we show how to embed simple UML state 
machines with output into the extended logic My. In Section 5, we provide an institution 
for simple UML composite structures by enriching our extended logic M% with elements 
capturing connectors and event queues. Again, “borrowing” of Cas theorem proving 
support is possible. Finally, in Section 6, we demonstrate that our approach allows for 
automated theorem proving. 


2 Background on Institutions and CAsL 


Institutions are an abstract formalisation of the notion of logical systems combining 
signatures, structures, sentences, and satisfaction under the slogan “truth is invariant under 
change of notation” [3]. Institutions can be related in different ways by institution (forward) 
(co-)morphisms, where a so-called theoroidal institution comorphism covers a particular 
case of encoding a “poorer” logic into a “richer” one. The algebraic specification 
language Cast [11] uses an institution of first-order logic at its basic specification 
level, where mainly signature items and axiom sentences are listed. On its structured 
specifications level, Cast offers institution-independent combination mechanisms to 
build larger specifications in a hierarchical and modular fashion. We use Cas.’s basic 
institution CFOL™ of first-order logic with equality and sort generation constraints [9] 
and construct a theoroidal institution comorphism from our hybrid modal logic institution 
M% to CFOL=. 


2.1 Institutions and Theoroidal Institution Comorphisms 


An institution T = (S7, Str7, Sen”, |7) consists of (i) a category of signatures ST; 
(ii) a contravariant structures functor Str? : (S7)°P — Cat, where Cat is the category 
of (small) categories; (iii) a sentence functor Sen? : SZ — Set, where Set is the category 
of sets; and (iv) a family of satisfaction relations Z, C |Str? (X)| x Sen? (X) indexed 
over X € |S*|, such that the following satisfaction condition holds for all o: X — X” in 
SF, y € Sen” (X), and M’ € |Str7(’)|: 


Str7(c)(M’) KE ọ M' 5, Sen*(c)(y) . 


Str? (a) is called the reduct functor, Sen” (o) the translation function. 

A theory presentation T = (X,®) in the institution Z consists of a signature 
X € |S*|, also denoted by Sig(T), and a set of sentences  C Sen? (X). Its model 
class Mod? (T) is the class {M € Str” (X) | M HZ ọ f.a. p € P} of the -structures 
satisfying the sentences in ®. A theory presentation morphism o: (X, 8) — (X”, 8’) 
is given by a signature morphism o: X —> X’ such that M’ F, Sen”(c)(y) for all 
p € $ and M’ € Mod” (X',®'). Theory presentations in Z and their morphisms form 
the category Pres”. 

A theoroidal institution comorphism v = (v°"9,vM°4, 5"): T — T’ consists 
of a functor v5*9: ST — Pres? inducing the functor vS = v°"9; Sig: SE — SZ on 
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signatures, a natural transformation vM°?; (5'9)°?, Mod? = Str on models and 
structures, and a natural transformation v°°" : Sen? —> v; Sen? on sentences, such that 
for all X € |SZ|, M’ € |Mod* (v59 (5))|, and y € Sen? (5) the following satisfaction 
condition holds: 


v5 (M') E5 p M' Hiss) PNC) . 


A theory presentation (X, ®) over the institution Z is translated via a theoroidal institution 
comorphism v: T — T' into the theory presentation v®™! (X, 8) = (X,, B, Uv" (B)) 
over Z’ where v®*9 (X) = (X,, ,) and v$ (8) = {u$ (4) | p € P}. 


2.2 Casu and the Institution CFOL= 


At the level of basic Casu specifications, CFOL= offers declarations of sorts, operations, 
and predicates with given argument and result sorts. Formally, this defines a many-sorted 
signature X = (S, F, P) witha set S of sorts, a.S* x S-sorted families F = (Fw .s)wscst 
of total function symbols, and family P = (Pw) wes« of predicate symbols. Using these 
symbols, one may then write axioms in first-order logic with equality. Moreover, one can 
specify data types, each given by a list of data constructors and, optionally, selectors. 
Data types may be declared to be generated or free. Generatedness amounts to an implicit 
higher-order induction axiom and intuitively states that all elements of the data types are 
reachable by constructor terms (“no junk”); freeness additionally requires that all these 
constructor terms are distinct (“no confusion’). Basic CAsL specifications denote the 
class of all algebras which fulfil the declared axioms, i.e., Cast has loose semantics. More 
formally, for CFOL~ a many-sorted 3/-structure M consists of a non-empty carrier set 
s™ for each s € S, a total function f M.: WM — sM for each function symbol f € Fw,s 
and a predicate p™ for each predicate symbol p € P,,. A many-sorted -sentence is a 
closed many-sorted first-order formula over X or a sort generation constraint. 


3 The Hybrid Modal Logic M4, for Event/Data Systems 


The logic MŁ is a hybrid modal logic for specifying and reasoning about event/data- 
based reactive systems. The modal part of the logic allows to handle transitions between 
system configurations where the modalities describe guarded configuration moves based 
on input and output events with arguments, i.e., messages, and the corresponding effects 
on data. The hybrid part of the logic allows to bind control states of asm configurations 
and to jump to configurations with such control states explicitly. M} with its signatures, 
sentences, and structures forms an institution. Furthermore, M$ can be translated into 
Cast via a theoroidal institution comorphism. 

We extend the logic and the comorphism of [16] by including output. A modal formula 
(i: $f [O]n : Yho now says that in the current configuration an input message according 
to 7 can be accepted if precondition state predicate ¢ holds and that, in response, output 
messages according to [O] and satisfying the transition predicate 7) can be produced 
such that o holds afterwards. The messages frame [O] y tells that besides outputs from 
O also additional messages according to N can be sent. This relativisation allows M% 
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to specify the “cone of messages above O” in a finite and, in particular, institution- 
compatible way that also is extensible into a theoroidal institution comorphism from 
M$ to CasL. We furthermore demonstrate that for pure M3} -invariants the comorphism 
leads to simpler Cast proof obligations that are easier to automate in theorem proving. 

For the inclusion of data in M3, we assume given a consistent, monomorphic CAsL 
specification Dt. The interpretation of the sorts S(Dt) of Dt represents the different 
kinds of data, like the integers or lists of integers. Requiring Dt to be monomorphic 
fixes these carrier sets as there is, up to isomorphism, a single model D of Dt. We 


also use open formule F SED t),x over sorted variables X = (X5)sc s(pt) and their 


satisfaction relation D, 8 ESD), x ¥ for a variable valuation 6: X — D, i.e., 


B = (Bs: Xs > sP )Jses(Dt)- 


3.1 Data States and Transitions 


A data signature A consists of a finite set of attributes |A| and a sorting s(A): |A| > 
S(Dt). A data signature morphism from a data signature A to a data signature A’ is a 
function a: |A| — |A’| such that s(A)(a) = s(A’)(a(a)) for all a € |A|. We sometimes 
identify A with the S(Dt)-sorted family (s(A)~'(s))ses(p1)- 

A data state w for a data signature A is given by an attribute valuation w: A > D, 
i.e., w(a) € s(A)(a)? for a € |A|; in particular, Q(A) = D4 is the set of A-data states. 


The state predicates FR x are the formule in F Sa (Dt), Aux? taking A as well as an 


additional S(Dt)-indexed family X as variables. A state predicate ¢ € FJ x is to be 
interpreted over an A-data state w and variable valuation 6: X — D and we define the 
satisfaction relation =P by 


w, BER x b => D, wU p ES Dy, aux ¢- 


The a-reduct of an A’-data state w”: A’ — D along a data signature morphism 
a: A — A’ is given by the A-data state w’|a: A > D with (w’|a)(a) = w'(a(a)) for 
every a € |A|. The state predicate translation F? y: FR x > FẸ, x along a: A —> 
A’ is given by the CasL-formula translation F Sty ),0ULx along the substitution aU 1x. 
Reduct and translation fulfil the following satisfaction condition due to the general 


substitution lemma for CAs: 


w'la, B E3 x ọ 4 w, E FP? x(¢) . 


A data transition (w, w’) for a data signature A is a pair of A-data states; in particular, 
N?(A) = (D4)? is the set of A-data transitions. It holds that (D4)? =~ D?4, where 
2A = A W A and we assume that no attribute in A ends in a prime / and all attributes in 
the second summand are adorned with an additional prime. The transition predicates 
FRY are the formule Fin, x: The satisfaction relation |?” for a transition predicate 


p € FAX, data transition (w, w’) € Q?(A), and valuation 6: X — D is defined as 


2D D 
(w,w’), B A,X Y < w+w', B F-2A,X p 


where w +w’ € 92(2A) with (w + w’) (a) = w(a) and (w + w’)(a’) = w' (a). 
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The a-reduct of an A’-data transition (w’,w’’) along a data signature morphism 
a: A — A’ is given by the A-data transition (w’,w”)|a = (w’|a, w”|a@). The transition 
predicate translation #2", along a is given by Fax with 2a: 2A — 2A’ defined by 
2a(a) = a(a) and 2a(a’) = a(a)’. Like for data states, reduct and translation fulfil the 
following satisfaction condition: 


(ww la, b Eb > ww"), BE x FER). 


3.2 Events and Messages 


An event signature E consists of a finite set of events | Æ| anda map 3(E): |E| > S(Dt)* 
assigning to each e € |E] its list of parameter sorts. An event signature morphism 
n: E — E’ is a function 7: |E| > |E’| such that 3(£)(e) = 3(£’)(n(e)) for all 
e € |E|. We write e(X) fore € |E| and 3(£)(e) = s1,..., Sn when choosing n different 
parameters X = £1, ... , £n, and also e(X) € E in this case; when f = e(X), we write 
X (f) for X and we furthermore lift this notation to sets and lists of events. We sometimes 
identify the parameter list X with the S(Dt)-sorted family ({x; | s; = s})ses (pe) and 
write 5(F)(e)(x;) for si. 

A message e( 6) over an event signature FE is given by an event e(X) € E with its 
parameters X instantiated by a parameter valuation 8: X — D such that B(x) € sP for 
3(E)(e)(«) = s; the set of all messages over an event signature E is denoted by E(E). 
When ê = e(8) € E(E), we write (é) for 3, and when e(X) € E and £: Y > D for 
X CY, we write e(Z) for e(8]X); both notations are furthermore lifted to sets and lists. 

The set of shufflings Ê | Ê, of two message lists Ê, and Ê; is inductively given by 


Flle={F}=ellF, 
fs Ê) Ê= {ff | FeR|A}=Al (f: Fa). 


An event signature morphism n: E — F’ is lifted to a message e() € E(E) by 
setting E(n)(e(8)) = n(e)(8) € E(E’) and also to sets and lists of messages. 


3.3 Event/Data Signatures 


An event/data signature X consists of input and output event signatures [(2’) and O(X), 
and a data signature A(X). An event/data signature morphism o: X —> 3%" consists 
of an input event signature morphism (o): I(X) —> I(”), an output event signature 
morphism O(c): O(X) > O(”), and a data signature morphism A(c): A(X’) > 
Ae). We lift the event signatures and signature morphisms to messages by writing 
ÎS) for E(I()), oe) for E(O(2)), Io ) for E(I(c)), and O(c) for E(O(o)). 

The category of MZ -signatures SM3 consists of the event/data signatures and 
signature morphisms. 


3.4 Event/Data Structures 


A configuration y = (c, d) consists of a control state c from some set of control states C 
and a data state d from some set of data states D. Given a data signature A the data state 
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of y may be labelled by a map w such that w(d) € (A). For a set of configurations I" 
we write C(I’) for its set of control states. 

A X-event/data structure M = (I, R, To, w) over an event/data signature X consists 
of a set of configurations I C C x D, a family of transition relations R = (R; 6 © 
rx T) iS cô and a non-empty set of initial configurations Ig C IT such that 
T is reachable from To via R, i.e., for all y € I there are yo € Io, n > 0, %,..., 
in € ÑS), O1,...,On € O(X)*, and (Yk, Yk+1) € eee forall 0 < k < n with 
Yn = 7; and a data state labelling w: D > Q(A(X)). 

We write c(M)(y) = c and w(M)(y) = w(d) for y = (c,d) € T, F(M) for T, 
C(M) for {ce(M)(7) | y € T(M)}, R(M) for R, T'o(M) for To, Co( M) for C (To), 
and Ro (M) for {w(M)(¥o0) | Yo € To}. 

The above definition restricts structures to reachable ones only. Although an M4- 
sentence will hold in an event/data structure if it is satisfied in all its initial states, the 
modal and hybrid operators of M$ will allow for expressing that a certain property 
holds in all (reachable) states of the structure. 

The o-reduct of a X'-event/data structure M” along the event/data signature morphism 
o: X — X' is the X-event/data structure M’|o such that 


- P(M'|o) © F(M') as well as R(M'|o) = (R(M"|o); 6 )sci(sy,6eocs)* are in- 
ductively defined by Iy(M’) © I'(M’|o) and, for all y, y” € F(M’,î € Î(5), 
and Ô € O(%)*, if y) € I'(M’|o) and (y,%") € R(M’) ftoi), ôtô) then 
q” € T(M'|o) and (7, 7”) € R(M"|o); 65 

- Io(M’\o) = Io(M’); and 

- w(M"|o) (7) = (@(M")(9'))|A(e) for all ~ € P(M"\o). 


This o-reduct keeps exactly those transitions that are a direct image along o. It would 
also be possible to additionally keep transitions that show a super-list of the outputs 
that can be reached by o. When moving to MZ -sentences, however, it turns out to be 
impossible to fix a particular list of outputs. 

Given sets of input events J C I(X) and output events N C O(X), we denote by 
TIN(M,q) and TN (M), respectively, the set of configurations of a X-event/data 
structure M that are J, N-reachable from a configuration y € IT (M) and from an initial 
configuration yo € I(M), respectively. Here a yn E€ I'(M) is J, N-reachable in M 
froma, € T(M) if there are n > 1, i,...,2n € T(J), O2,...,On € O(N)*, and 
(Yi, Yi+1) € R(M) forall <<k <n. 


The X-event/data structures form the discrete category S rp (X) of MZ -structures 
over X. For each o: X > X’ in SM? the o-reduct functor Str (a): Str (2”) > 
Str’ (3) is given by Str“? (o)(M') = M'o. 


îk+1:Ôk+1 


3.5 Event/Data Formule and Sentences 


a 
The 2’-event/data formule FRO over an event/data signature X and a set of state 
variables S are inductively defined by 


GD : TEE 
— y— data state sentence y € F% (5) holds in the current configuration; 
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— s — the control state of the current configuration is s € S; 


M$ 

— |s.o— calling the current control state s, formula o € FM > sia {s} holds (s is turned 
into a fresh variable by adding to by disjoint union to the set of state variables); 

— (@%% s)o — in all configurations with control state s € S that are J, N-reachable, 


4 
formula o € FED holds (relativised “jump”); 


- O%N o —inall e that are J, N-reachable from the current configuration 


formula o € gM yg holds (relativised “globally”); 

— (if/[O|n : Y)o — in the current configuration there are valuations 6: X(i) > D, 
B': X(O) —> D, and a transition for the incoming message i(3) € I(X) and the 
outgoing messages Ô' € O(8') || Ñ for O(8’) € O(%)*, Ñ € O(N)* such that 8U 6’ 

MŁ 
satisfies transition formula Y € #2”, A(2),X (i)UX(O) and o E€ F 5, AD holds afterwards; 

— (i: ¢f[O]n : Yho — in the current configuration for all valuations 6: X(t) >D 
satisfying state e amla pE FP a(x), x(i) there are a valuation 6": X(O) — D and 
a transition for the incoming message i(8) et (X) and the outgoing messages 
Ô' € O(8’) || Ñ for O(8") € O()*, Ñ € Ô(N)* such that 8U 8’ satisfies transition 

M} 
formula Y € FIE, x (i)UX(O) and 9 € aoe holds afterwards; 
l 
— ~o — in the current configuration o € Ao does not hold; 


ve 
— 01 V 02 — in the current configuration 0, € Ge S Of 02 € ae: ? hold. 


We write (Qs) for (@1(*)-(~) s)0, Oo for O1(*):O) o, fi J [O]n : Wo for ~li f 
[O]n : Y)}—o, and true for }s . s; we write O for [O]9. 

Two different kinds of relativisations are used in M3} -formulæ: For the jump operator 
(@ŻN s)o and the globally operator 0%" o the subsets of input events J C I(X) and 
output events N C O(X) restrict the referable states in an M¢}-structure to those that 
are J, N-reachable. On the other hand, [O] specifies that besides messages from O 
additional messages for events in N C O(X) can be mixed into the output, such that, in 
particular, [O]g requires exactly O. Since the set of output events is assumed to be finite, 
[O]n can be used to specify message lists of arbitrary length with finitely many formule. 
Moreover, the syntactic information in both kinds of relativisations is kept through a 
translation to another M¢5,-signature. 

Let o: X — &” be ai event/data signature morphism. The event/data formulæ 


ME 
translation FND. a D_» FM sg along ø is recursively given by 
M 
= Fe y) = ae o(p); 
- FNP s) = s5; 
M My 
- Fy 9° (1s.0) = 18. F, suts RG o); 
= FNP GIN s) 0) = (QLA OONN) 5). F Mag ); 
> FE IN 9) = DDU), a 
MBG 
- Fz (GPO : be) = 


—™~ Qa 
= 
oa 
S 
DS 
© 
a 
= 
2 
€ 
z 
% 
> 
5 
x 
č 
A 
S 
Ss 
\ 
q 
U 
© 
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FAO (Wi : oA [lO]n : v)e) = 
ns 


(LONG) © FR o xo OTONO oom) : F2) xouxo IFA (0); 


M} Mt 
- Fy (70) =F; s (0); 
M} M} 
- F, g (o1 V 02) = Fz s (01) V he 1 (02). 


The set Sen™» (X) of X-event/data sentences is ae by Fog Mo , the event/data 


sentence translation Sen™? (o): Sen™ > (X) — Sen D (X') by FP 


3.6 Satisfaction Relation for M$ 


Let X be an event/data signature, M a X-event/data structure, S a set of state variables, 
v: S — C(M) a state variable assignment, and y € T (M). The satisfaction relation for 
event/data formule is inductively given by 


M ; 
= M,v, y Ky s y iff w(M)(7),0 DET, P; 
- M,v,y z s iff v(s) = c(M) (y); 

M% ; Mb . 
- M,v,y =y 5 48. oiff M,v{s = c(M)(y)},7 F's Sus} 0; 


4 
- M,v,y EXB (07N s)oiff M, v, y H53 oforally’ € PN (M) withe(M)(y') = 
v(s); 
ay 
- M,v,y HZB OF o iff M,v,7 HZB o forall y € TY (M,7); 


- M,v,7 =? (i //[O]n : Yjoiff there are valuations 8: X (i) > D, 8': X(O) > D, 


output messages Ô' € O(’) || Ñ with Ñ € Ô(N)*, and a configuration 7/ € ['(M) 
such that (Y ) € R(M up i(B ), oO” (w (M)(7),w(M)(9’)), BUP’ (59), X (UX (O) p, 


$ 
and M, v, y =} 03 


— M,v,7 Ese (i: of [O]n : pho iff for all valuations 8: X(t) —> D that satisfy 
w(M)(y), 6 F2), @ there are a valuation 3’: X(O) — D, output messages 
O’ € O(8') || N with N € O(N)*, and a configuration y’ € I’(M) such that 
(157) © RM) aga on WANG) oM), EU E EPs xauxo > and 


M, v, y I-39 0; 


Mt ; M 
= M,v,y Fy § no iff M, v, y Ess 0; 


Mt Mz My 
= M, v, y Fs § 01 V 02 iff M, v, y Ess Qy or M,v,y Fss Q2- 


Fora X € ISM |, an M € |Str’>(5)|, and a p € Sen» (X) the satisfaction 
4 + 
relation M |=%® p holds if, and only if, M, 0, yo HX? p for all yo € Ip(M). 


1 7 J aL ae med. Ate, 
Theorem 1. (SM?, Str? Sen™> , =M?) is an institution. 
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from Basic/StructuredDatatypes get List, SET % import finite lists and sets 
spec TRANS = Dt 
then free type InEvt ::= (2) 
free type OutEvt ::= O(2) 
then List[sort OutEvt] and Ser|sort InEvt] and Ser[sort OutEvt] 
then sort Ctrl 
free type Conf ::= conf (c : Ctrl; A(X)) 
preds init : Conf; 
trans : Conf x InEvt x List[OutEvt] x Conf 
- dg : Conf - init(g) % there is some initial configuration 
then free { pred reachable : Set[InEvt] x Set/OutEvt] x Conf x Conf 
Yg, g', g” : Conf; J : Set[InEvt]; N : Set[OutEvt]; i : InEvt; O : List{OutEvt] 
- reachable(J, N, g, g) 
- reachable(J, N,g,g' ) AiE JAO CNA trans(g', i, O, g”) > 
reachable(J, N, g, g”) } 
then preds reachable(J : Set[InEvt], N : Set[OutEvt], g : Conf) = 
Ago : Conf - init(go) A reachable(J, N, go, g); 
reachable(g : Conf) & reachable(I (X), O(X), g) 
then pred mixed : List[OutEvt] x Set[OutEvt] x List[OutEvt] 
Vo, o’ : OutEvt; O, O' : List[OutEvt]; N : Set[OutEvt] 
- mixed(O, N, O) 
- mixed(o :: O, N, o :: O’) if mixed(O, N, O’) 
- mixed(O, N, o :: O’) if mixed(O, N,O’) Ao! € N 


end 


Figure 2. Frame for translating Mi into CasL. 


3.7 A Theoroidal Comorphism from My to Casi 


We define a theoroidal comorphism from M$ to Casu. The construction mainly follows 
the standard translation of modal logics to first-order logic [1] and extends the scheme 
of [16] by outputs. 

The basis is a representation of M}-si gnatures and the frame given by MZ -structures 
as a CasL-specification as shown in Fig. 2. The signature translation 


; 4 
y59: SMD 5 Press" 


maps an M¢,-signature X to the Cast-theory presentation given by TRANS» and an 
M3} -signature morphism to the corresponding theory presentation morphism. TRANS 5 
first of all covers the events according to [(’) and O(X) with types InEvt and OutEvt, 
and the configurations with type Conf showing a single constructor conf for the control 
state from Ctrl and a data state given by assignments to the attributes from A(X). 
Furthermore, TRANS» sets the frame for describing reachable transition systems with a 
set of initial configurations, a transition relation, and reachability predicates, where the 
specification of reachable uses Casr’s “structured free” construct to ensure reachability 
to be inductively defined. Finally, a predicate mixed is included for representing the 
shufflings of a list of outputs with some additional output events. 
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The model translation 
vMod; Mod@*(59(5)) > Str (5) 


then can rely on this encoding. In particular, for a model M’ € Mod“ (v5#9()), 
there are bijective maps t m’ Cont: Conf = Ctrl x 92(A(2)) for the configurations 
as well as UM! InEvt: InEvt™” = I(x) and LM’ ,OutEvt : OutEvt™”’ > Ô(5) for the 
messages. Moreover, mixed” (ir onii (O), inate NO) bar ont (O’)) if, and 
only if, Ô' € Ô || N’ with N’ € N*. The MZ -structure resulting from a Cas_-model 
M’ of Trans» can thus be defined by 


- P(vM°9(M")) = tif cons{9" € Conf” | reachable’ (g™')}) 
- RUAM’); o ={,7) € Pv ze) x D(vyee(M")) 
M: (eu, Cont (7), int InEvt (2 2), tM onteve(O); tm’ Cont (7’))} 


= To(upee'(M’)) = = {y € POMM’) | init™” (em cone())) 
- w(v¥4(M")) = {(e,w) € TUS (M) > w 


trans 


For M$ -sentences, we first define a formula translation 


F 
US sg: FO + FY 8(5),SU{g} 


which, mimicking the standard translation, takes a variable g : Conf as a parameter 
that records the “current configuration” and also uses a set © of state names for the 
control states. The translation embeds the data state and 2-data state formule using the 
substitution A(X’)(g) = {a > a(g) | a € A(2)} for replacing the attributes a € A(X) 
by the accessors a(g). The translation of M(},-formule then reads 


- VŽ sV) = FAS, acsy(a) ) 


- VŽ s (5) = (s = c(g)) 

- VĒ 3 g( 8.0) = Js : Ctrl. s =c(g) A VÝ sw{s} (0) 

- US 9 9( @IN s)o) = Yg' : Conf. (c(g') = SA gouehabled, N,g')) => VÝ 9 4 (2) 
- UE 9 gl HN, o Conf. reachable(J, N, g, g') > VÝ s g (0) 

- v s (üJ [lw : byo) = 3X : 3D); X : 3(:0(E))(O); 


O' : List[OutEvt]; g' : Conf. 
mirer), N, O') A trans(g,i(X),O',g') A 
, FRE), AC(S) (gUA(S)(g')UI x Ux! (hb) A VŠ, sg (0) 
- vs gli: OP [Ow : 00) = VX SUENO F aou (8) S 
JX’ :3(0(%))(O); O' : List[OutEvt]; g’ : Conf. 
mixed(O(X’), N, O’) A trans(g, i(X), O’, 9’) A 
F RAELA) U xox P) A YB, 5,9" (8) 


= VÝ 9 (0) — WE g (o) 

= VÝ s,(01 V 02) = VÝ s (01) V VÝ s (02) 
Building on the translation of formulæ, the sentence translation 
v8": Sen (5) > Sent (1 (5)) 


only has to require additionally that evaluation starts in an initial state: 
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— VŠ (p) = Vg : Conf . init(g) > VS 9,4(P) 


Theorem 2. (v59, vM°4, yS°") is a theoroidal comorphism from M% to CASL. 


For a Cast-proof of an M¢,-invariant y such that ọ has to hold in every reachable 
configuration, the full generality of the reachable predicate can sometimes be avoided 
by replacing the proof obligation Vg : Conf .reachable(g) = FB A(3)(9) (p) by 
the usual stepwise induction scheme that only requires to demonstrate the invariant to 
hold in all initial configurations and that it is preserved by every transition. Moreover, 


the Mš% -state formula y can be generalised into a CasL-invariant. 


Proposition 1. Let (X, P) be a theory presentation in M$ and (v’(X), 8) a theory 
presentation in Cast such that Mod (vP?®(5,P)) C Mod™^™ (15 (5), 8). Let 


inv"(g) € FCS) fa) be a Cast-formula with a single free variable g and inv“? € 


FAS) an My-state formula, such that 


s ASL ASL + + 
(10) Vg : Conf . inve“ (g) > FE) a(n) (q) (inv? ) 


(11) Vg : Conf . init(g) => inv" (q) 


(I2) Vg,g' : Conf; i € InEvt; O € List/OutEvt] . 


inv (g) A trans(g, i, O, g') > inv" (g’) 


4 
hold in every model M' € Mod™™ (15 (5X), 8). Then vM°4(M') Ey? inv» for 


all models M' € Mod“*"(vP"*5(3), P)). 


4 Simple UML State Machines with Outputs 


UML state machines [13, Ch. 14] provide means to specify the reactive behaviour of 
objects or component instances. These entities hold an internal data state, typically given 
by a set of attributes or properties as specified in a static structure, and shall react to event 
occurrences like incoming messages by firing different transitions in different control 
states. Such transitions may have a guard depending on event arguments and the internal 
state and may change, as an effect, the internal control and data state of the entity as 
well as send out messages on their own. Beyond such “simple” means for specifying 
reactive entities, UML state machines offer also more advanced modelling constructs, 
like hierarchical states or compound transitions, which, however, we defer to future work. 
In our formal account, extending again [16], a simple UML state machine with 
outputs U uses an event/data signature 57(U) for its input and output events as well as 
its attributes and consists of a finite set of control states C(U); a finite set of transition 
specifications T(U) of the form (c, ġ, i(X), 01(X1),..-,0m(Xm), Y, c’) with 
— source and target control states c, c' € C(U), 
— input event i(X) € I(X(U)) and output events 01(X1),..-,0m(Xm) E€ O(Z(U)) 
such that X N Uy<pem Xm = 4, 
— precondition state predicate ¢ € FSU), X and 
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— postcondition transition predicate Y € FIU), XU crem Xp? 

an initial control state co(U) € C(U); and an initial state predicate po(U) € FRU) 
such that C(U) is syntactically reachable, i.e., for every c € C(U) \ {co(U)} there are 
(co(U), Qı, i1, 01, %1, C1), nenia (Cn—1,; Tri On; Wns Cn) E T(U) with n > Oandc, = c. 
The constraint of syntactic reachability is only introduced to simplify semantic and 
algorithmic constructions on simple UML state machines with output. 

A X(U)-event/data structure M is a model of a simple UML state machine U with 
output, M € Mod (U), if C(U) C C(M) up to a bijective renaming, Co(M) = 
{co(U)}, Qo(M) © {w € |[R(A(S (U) | w =u) po(U)}, and if the following 
holds for all (c,d) € ['(M): 


— for all transition specifications (c,¢,i,O,w,c’) € T(U) and 8: X(i) > D with 
w(M)(d), 6  A(5(U)),X (4) ¢, there isa 8’: X(O) —> D and a pair ((c, d), (c’, d’)) € 


R(M)ivg),0(8") such that (w(M) (d), w(M) (d')), pug ESU), XALXO) p; 
— for all pairs ((c, d), (c',d')) € R(M)jg),0(g") there is some transition specification 
(c, ġ,i, O, a,c’) € T(U) such that w(M)(d), 8 FRW) ġ and also (w(M)(d), 


w(M)(d')), BU B H Rew, xaaguo) V- 


The last requirement that all transitions in a model are due to transition specifications 
does not cover the requirement of input enabledness for UML state machines: An event 
for which currently no transition can fire is discarded. This behaviour can be added by 
a syntactical transformation extending the set of transition specifications by self-loops 
with empty outputs for all situations where some event is not accepted. 

In UML, completion events are produced whenever a state completes its internal 
behaviour and such events have always to be prioritised in event processing; the reaction 
to a completion event is indicated by a transition without a triggering event. For the 
simple machines with output described here, where states do not show internal behaviour, 
the only use of completion events is to let a machine make progress autonomously 
without external input. For using this feature, the machine’s event/data signature has to 
be extended by such events and the transition specifications have to take completions 
into account. Still, the prioritisation cannot be covered by a single state machine alone, 
as it has no event processing discipline of its own. 

Extending the characterisation algorithm in [16] with outputs, it can be shown that 
M$ is expressive enough to capture the model class of a simple UML state machine 


with output U by a single sentence oy such that M € Mod? (U ) if, and only if, 


M EM, ou. The simplest case is a single transition specification (c, œ, i, O, Y, c’): 
By requiring (Qc)(i : ¢ / O : w/c’ it can be ensured that a model indeed shows a 
transition from control state c to the control state c’ for the input event 7 with precondition 
@ satisfied which outputs O with ~ satisfied. For requiring that such a transition for 
input z and output O is only offered when the precondition ¢ and the transition condition 
w hold, a formula (@c)fi / O : ad V —7)]false has to be added. For ensuring that no 
other output than O can be produced, on the one hand (@c)[i / O’ : true}false for 
every O’ Æ O that is at most the length of O has to be added and on the other hand 
(@c)|i A [O"]ocs) : true}false for every O’ with length one more than O. 
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Reasoning over a simple UML state machine with output U in Cast via the translation 
of U’s characterising sentence along the theoroidal comorphism of Thm. 2 will involve 
some not fully transpicuous axioms due to the necessary exclusion of some behaviour using 
formule like (Qc) [i //[O’]o(x) : true}false. It is therefore sometimes advantageous to 
directly use the requirements for M being a model of U to obtain another characterisation 
of the trans predicate in the Cast presentation for the comorphism, which then can be 
favourably combined with Prop. 1 for proving invariants: 


Proposition 2. Let U be a simple UML state machine with output and let M' € 
ModS“! (9 (35(U))) such that ve (M') € Mod (U). Then 


M' aie Yg : Conf . reachable(g) => 
(v9! : Conf; i, : InEvt; Ox : List{OutEvt] . trans(g, ix, Ox,g') <=> 
Veeg ioy enero) X  SU(Z)) (4); X : 5(O(Z))(O) - 


c(g) = cA FES Alour () A tx = UX) AOs = 0(X') A 


FRO) A(E)(QUA(E)(gUL xix V) Acl) = c') i 


5 Simple UML Composite Structures 


A UML composite structure [13, Ch. 11] specifies the internal structure of a class or 
component and its collaborations. For our purposes, a composite structure is given 
by class or component instances, its so-called parts, that can communicate through 
their attached ports specifying provided and required interfaces and being linked by 
connectors. All connectors are assumed to be binary and each part to be equipped with a 
state machine for describing its behaviour. 

A composite structure signature A over M$ consists of a set Cmp(A) of parts 
c each equipped with an M3} -signature X(A, c) for its input and output events and 
internal attributes; a set Prt(A) of ports p each showing a part cmp(A)(p) € Cmp(A) 
as well as an M$ -signature 5(A, p) without attributes (ie., A(2(A,p)) = 0) for 
its provided (input) and required (output) events; and a symmetric binary relation 
Con(A) C Prt(A) x Prt(A) of connectors such that 


— for each part c € Cmp(A), the input and output events of ’(A, c) are the provided 
and required events of c’s ports prefixed with the port name, i.e., for F € {I, O}, 
F(X(A, ©) = Upecmp(ay-(o (P-F | f E F(C, p)) 5 

— for each part c € Cmp(A), the attributes of X (4A, c) are all prefixed with c, i.e., if 
a E€ A(X(A,c)), then a = c.ax; 

— for each connection (p, p’) € Con(A), the required events of port p are provided by 
p', ie, O(X(A, p)) C I(2(A, p’)). 


We say that port p € Prt(A) is open in A if there is no p' € Prt(A) such that 
(p, p') € Con(A); otherwise p is connected. 
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A composite structure signature morphism 6: A — A’ over M4 consists of 
a function Cmp(ô): Cmp(A) —> Cmp(A‘) mapping parts, together with an M5- 
signature morphism X (ô, c): X(A, c) > 2'(A’, Cmp(6)(c)) for each c € Cmp(X); a 
function Prt(ô): Prt(A) — Prt(A’) mapping ports, together with an MŁ -signature 
morphism Prt(ô) (p): (A, p) > X(4', Prt(ô)(p)), preserving 


— the part owning each port p, i.e., Cmp(6)(cmp(A)(p)) = emp(A’)(Prt(6)(p)); 
— the connections, i.e., if (p, p') € Con(A), then (Prt(d)(p), Prt(d)(p')) € Con(A’). 


The category of cs(M})-signatures S°(M>) consists of the composite structure signa- 
tures and signature morphisms over My. 

For an cs(MŁ )-signature A, a A-composite structure structure (sic!) over M% isa 
family E € (@(c) € |Str™ Ms P(A, c))|)ceCmp(A) Consisting of an MZ -structure for 
each part c. The ĝ-reduct @’|6 of a A’-composite structure structure @’ over M$ along 


a composite structure signature morphism 6: A — A’ is computed component-wise as 
(6'(Cmp(6)(c))|2'(4, €))ce @mp(a). The A-composite structure structures form the dis- 
crete category Strep) (A) of cs(M})-structures over A. For each signature morph- 
ism 6: A > A’ in SSD) the 6-reduct functor Str) (6): Str) A’) > 
Str MD) (A) is given by Str MD (S(E") = 6" |b. 

In UML, state machines organised in a composite structure communicate with each 
other by sending messages which are stored in event pools. A state machine draws a 
message from its event pool, which is typically implemented as an event queue, and 
reacts to this message by firing one of its enabled transitions or by discarding it when no 
transition is enabled. This communication scheme is obtained for a A-composite structure 
structure @ over My by constructing an overall Mš} -structure over an Mš% -signature 
that reflects the parts, the ports, and the connections in its events and attributes, but 
includes explicit event queues as additional attributes. The overall M3} -structure over 
this queue-based M3} -signature then implements the selection of an event from a part’s 
event queue, the reactions of this part to this event, and the distribution of the produced 
messages to the connected parts. 

Formally, we construct a functor Xq : so(M5) — SM5 on signatures that assigns 
to a composite structure signature A the queue-based event/data signature Xq(4) = 
Ucecmp (ay (2 (4, 0) U {de : I(5(A,c))*}) and to a composite structure signature 
morphism the canonically corresponding event/data signature morphism. For a composite 
structure signature A and a part c € Cmp(A) there is a natural signature embedding 
nic: (A, c) > Xa(4). 

For a A-composite structure structure @ we construct an overall ¥’,(A)-event/data 
structure M& as follows: An overall configuration of M¢ consists, for each part 
c € Cmp(A), of an event queue q(c) € 1((A, c))* stored in the attribute qe and a part 
configuration y(c) € T (€ (c)); initially, all parts are in some of their initial configurations 
and all event queues are empty. In an overall configuration (q(c),7(C))cecmp(A) an 
overall transition to another overall configuration (q'(c),7'(C))ceCmp(a) reacts to 
some 7 € [(¥4(A)) and outputs some Oe O(24(A))*. This 7 can either instantiate 
some provided event i € I(X(A, px)) of some of the open ports p, € Prt(A) with 
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cC, = cmp(A)(p), or it is the head of the event queue of some c, E€ Cmp(A) such 
that i € [(2'(A,c,.)). In the latter case, 7 is removed from the event queue of c+. In 
both cases, the reaction of part c, is any transition (y(c.), y4) E€ R(@(c)); 6 and 


overall y! = 7{c, ++ y’, }. Finally, all outputs p.6 € Ô such that (p, p') € Con(A) and 
cmp(A)(p’) = c are appended to the respective event queue of part c’. This defines 


+ 1. L 
a natural transformation sere) : Str MD) + Aa; StrM? with a (@) = 


Mge. 


Theorem 3. (SM5), Str M5), SenM), Lcs(M5)) withSen™> = Sg SenM» 
cs(M$) ; cs(M$) M$ ; PIET 
and € HA o if, and only if, Str, A (E) Es la) @ is an institution. 


cs( M$) inherits the event/data formulæ of M% and the underlying D, though 
extended by queue attributes. In particular, we have for a part c € Cmp(A) that a 
transition sentence (i : ¢//O : w)o (in the current configuration there are valuations 
and a transition for the incoming message and the outgoing messages such that these 
valuations satisfy transition formula y and o holds afterwards) locally formulated for 
this part can be faithfully transferred to the global composite structure, abbreviating the 
embedding 74 , to 77, 


{n(i) : FR xc () A (ha(ae) = (n)(i) V openg (F(n)(@)) f 
O(n)(O) : FAC), xa uxo) L) A 
Maca sq(A)\(AEA,U{AeleeOmp(ayy E= ON 
dist a,(I(7) (i), O(n)(O), (de, dt.) ce Cmp(A)) )Sen™? (n) (o) , 


where hd yields the head of a queue, open checks whether the part’s port for the event 
is open, the frame condition a = a’ ranges over all attributes not pertaining to c or the 
queues, dist removes the input and distributes the outputs to the queues. 


6 Verification Example: Communication between User, ATM and 
Bank 


We applied‘ the technique set out in this paper to the example from the introduction 
concerning a typical interaction between a User, an ATM component and a Bank 
component. 

We formalised the state machines for the Bank and the ATM as well as their communic- 
ation in CAsL. We then set out to show a safety property (by means of a stronger invariant) 
on this system by inductive verification, as justified by Prop. 1. We first tried to show 
the preservation of said invariant using fully automatic provers connected to Hets [10], 
the main tool suite for verification based on Cast and institution theory. However, no 
inductive automated provers are currently connected to Hets. Therefore, handling freely 
generated datatype would require manual intervention to add suitable induction schemes 
— defeating our goal of automation. Instead we utilised the interactive theorem prover 


4 Full specifications and proofs accessible at: https://rosento.github.io/202 1-paper-composite/ 
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KIV [2]. This prover supports algebraic specifications similar to Cast and offers extensive 
heuristics for inductive proofs. KIV’s heuristics fully automatically discharged all proof 
obligations in our experiments. The translation of the Casu specifications into KIV is 
straightforward. 

With our process clarified, we can now state the safety property we will prove: 


safe-def: safe(g) + (ctrl(caConf(g)) = Verified — wasVerified(cbConf(g)) = 1); 
used for: s, ls; 


The above introduces an axiom safe-def defining the predicate safe and marks the 
axiom for use as a simplifier rule (s) and a local simplifier rule (1s) for the KIV system. 

The predicate safe ranges over a type of system configurations, each consisting of the 
ATM configuration (caConf) and queue, as well as the bank configuration (cbConf) and 
queue. The machine configurations in turn consist of the control state and attributes. The 
safety predicate holds in a configuration iff should the ATM be in control state Verified, 
the bank attribute wasVeri fied has the value 1. 

The behaviours of Bank and ATM are defined in the form of an initial state predicate 
and a transition predicate. For space reasons we show only one transition: 
atmTrans-def: atmTrans(atmConf(sal, cl, pl, t1), in, out, atmConf(sa2, c2, p2, t2)) 

+ dc: CardId, p : Pin.... 
v (sal = CardEntered 
A in=msg(userCom, PIN(p)) A out = (msg(atmCompl, PINEnteredComp1) +1 []) 


A p2=p A sa2=PINEntered A c2=cl A t2= t1) 
V ...; used for: s, ls; 


The ATM transitions from one configuration to another, receiving an input event and 
sending out a list of messages. Each ATM configuration consists of (in that order) the 
control state, the card id to be verified, the PIN to be verified and the counter for the 
number of verification attempts. We give the definition of the transition predicate by a 
disjunction of the conditions of all syntactic transitions, including the control state before, 
the input event, the output list, variables to be set, the control state after and variables 
to remain unchanged. Given these machine predicates and a predicate dist to encode 
connectors, we can then define the transition predicate for the overall system: 
trans-def: trans(conf(cal, qal, cb1, qb1), in, out, conf(ca2, qa2, cb2, qb2)) 

4+ dist(out, qal, qa2, qb1, qb2) 


A ( (atmTrans(cal, in, out, ca2) A cb2 = cbl ) 
V (bankTrans(cb1, in, out, cb2) A ca2 = cal)); used for: s, ls; 


Initially, the queues are empty and the machines are in their initial configurations. 

Having thus defined the machines, we turn to verification and define an invariant 
strong enough to show both its own preservation and our safety property. The idea is to 
control the queues’ status that allows us to enter the Verified state on the ATM or to 
reset the wasVeri fied attribute. In essence the invariant can be syntactically read off 
from the composite structure. 


invar-def: invar(conf(ca, qa, cb, qb)) «+ 4 x. 


(ctrl(ca) = Idle A ctrl(cb) = Idle A qa = empty A qb = empty) 
v (ctrl(ca) = CardEntered A ctrl(cb) = Idle A qa = empty ^ qb = empty) 
V (ctrl(ca) = PINEntered A ctrl(cb) = Idle A qa =enq(x, empty) A qb = empty) 
vV (ctrl(ca) = Verifying ^ ctrl(cb) = Idle A qa = empty A qb = enq(x, empty)) 
v (ctrl(ca) = Verifying ^ ctrl(cb) = Verifying A qa = empty ^ qb = enq(x, empty)) 
v (ctrl(ca) = Verifying A ctrl(cb) = VeriSuccess ^ 


qa = empty ^ qb = enq(x, empty) ^ wasVerified(cb) = 1) 
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v (ctrl(ca) = Verifying ^ ctrl(cb) = VeriFail A qa = empty ^ qb = enq(x,empty)) 
v (ctrl(ca) = Verifying ^ ctrl(cb) = Idle A 
qa = enq(msg(bankCom, reenterPIN), empty) A qb = empty) 
v (ctrl(ca) = Verifying A ctrl(cb) = Idle A 
qa = enq(msg(bankCom, verified), empty) A qb = empty ^ wasVerified(cb) = 1) 
v (ctrl(ca) = Verified A ctrl(cb) = Idle A 


qa = enq(x, empty) A qb = empty ^ wasVerified(cb) = 1); used for: s, ls; 


Note that we can mostly ignore attribute values, as well as all distinctions between 
queue elements unrelated to our verification task. We can then formulate lemmas to the 
effect that this invariant does in fact imply the safety property, that it is satisfied in any 
legal initial configurations and that it is preserved by all transitions. These lemmas are as 
follows, again limited to one example for the transitions: 


Safe: invar(g) —> safe(g); 
Init: init(g) — invar(g); 


Trans6: gl = conf(atmConf(Verifying, c, p, t), qa, cb, qb) 
A qa # empty ^ top(qa) = msg(atmCom, verified) 
A g2 = conf(atmConf(Verified, c, p, t), 
enq(msg(atmCompl, VerifiedCompl), deq(qa)), cb, qb) 
A invar(gl) — invar(g2); 


Formulating separate lemmas for each transition instead of one lemma using the 
transition predicate helps us avoid a combinatorial explosion in the theorem prover. 

Providing our specification to KIV with all definitions marked as simplifier rules and 
activating the heuristics mode “PL heuristics + structural induction’, each of our lemmas 
is proved without noticeable delay, i.e., the verification of the invariant is successful and 
does not pose any difficulty to the prover. 


7 Conclusion 


We have developed two new institutions extending the hybrid modal logic M% [16]. 
One institution caters for simple UML state machines with outputs, an extension of it 
captures simple UML composite structure diagrams. Besides providing formal semantics 
for communicating UML state machines, via comorphisms these institutions provide a 
bridge towards theorem proving for UML. Through an elementary example we could 
demonstrate that, thanks to our framework, effective automated theorem proving for 
communicating UML state machines is possible. 

Future work will be on proof automation. In particular we plan to implement the 
translations from UML into extended M5, the institution comorphisms from extended 
M$ to Casu, and possibly the link from Hets to KIV. Yet another important aspect is 
to implement analyses of the composite structure and its state machines with a view to 
automatically generate lemmas for automated theorem proving. In terms of our general 
research programme, the next topic to tackle are UML interactions and how they relate or 
refine to UML state machines. Going beyond the UML, it would be interesting to consider 
a truly heterogeneous framework, in which composite structure diagrams connect not 
only UML state machines, but also components specified in languages such as TLA or 
Event-B. 
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Abstract. Nowadays, software development is accelerated through the 
reuse of code snippets found online in question-answering platforms and 
software repositories. In order to be efficient, this process requires form- 
ing an appropriate query and identifying the most suitable code snippet, 
which can sometimes be challenging and particularly time-consuming. 
Over the last years, several code recommendation systems have been de- 
veloped to offer a solution to this problem. Nevertheless, most of them 
recommend API calls or sequences instead of reusable code snippets. Fur- 
thermore, they do not employ architectures advanced enough to exploit 
the semantics of natural language and code in order to form the optimal 
query from the question posed. To overcome these issues, we propose 
CodeTransformer, a code recommendation system that provides useful, 
reusable code snippets extracted from open-source GitHub repositories. 
By employing a neural network architecture that comprises advanced 
attention mechanisms, our system effectively understands and models 
natural language queries and code snippets in a joint vector space. Upon 
evaluating CodeTransformer quantitatively against a similar system and 
qualitatively using a dataset from Stack Overflow, we conclude that our 
approach can recommend useful and reusable snippets to developers. 


Keywords: code reuse - semantic analysis - neural transformers. 


1 Introduction 


The wide uptake of open-source software in the last few decades has accelerated 
software development through code reuse. Nowadays, developers search online 
for ways to solve issues that arise during the development process, such as writing 
code for complex tasks, integrating APIs, or fixing bugs. The popularity of this 
paradigm has been further boosted from the introduction of online repositories 
(e.g. GitHub) and programming communities (e.g. Stack Overflow). 

As code reuse has become a vital aspect of today’s software development, 
the challenge of finding appropriate answers to programming-related questions 
in the vastness of the Internet led to the development of code recommendation 
systems. While the majority focus on providing API calls and sequences (e.g. 
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DeepAPI [10]), a selected few have the advantage of recommending reusable code 
snippets (e.g. DeepCS [9]). Such systems that employ whole snippet extraction 
mechanisms are greatly valued, as they significantly reduce development time. 

However, they are also prone to important limitations. Many accept queries 
in specialized query languages instead of natural language. In addition, most 
systems do not employ mechanisms advanced enough to extract the semantics 
found both in the queries and the source code. And even though some systems 
engage in semantic analysis (e.g. DeepCS [9], CodeSearchNet [12]), crucial in- 
formation, such as the control flow of a code snippet, is discarded. Finally, the 
aforementioned systems typically employ non-annotated datasets and, by exten- 
sion, lack in terms of training and quantitative evaluation, as ground truth data 
are essential for the training of a system and the assessment of its performance. 

Acknowledging the need for advancing code reuse, GitHub initiated the Code- 
SearchNet challenge [12], a public competition for code search, specifically aiming 
to improve on four baseline models using an annotated dataset. These models 
receive queries in natural language and employ different neural network architec- 
tures to return high-quality code snippets. The CodeSearchNet challenge overall 
provides an interesting testbed due to the variety of programming languages and 
code snippets in the dataset and the evaluation tools offered. 

Given influence by this challenge, in this paper we present CodeTransformer, 
a system that receives natural language queries and provides reusable code snip- 
pets. CodeTransformer uses state-of-the-art neural network and language un- 
derstanding techniques, while it also employs a custom similarity metric and 
a custom loss function. Our system does not require some specialized query 
language; instead, it receives queries in natural language and employs neural 
machine translation to offer reusable snippets in the form of methods. We train 
our system on a state-of-the-practice annotated dataset and evaluate its effec- 
tiveness against the baseline CodeSearchNet systems [12]. Finally, we assess its 
applicability in a question-answering context using data from Stack Overflow. 


2 Related Work 


Code search systems can be distinguished into two categories, those producing 
sequences of API calls and those producing reusable code. The first category in- 
cludes systems such as SWIM [21] and T2API [19], which translate text queries 
to API calls and then synthesize their usage code, i.e. code that uses the calls. 
SWIM extracts API calls related to a query using Bing and forms their usage 
code, including the control flow. A limitation is that it cannot handle the seman- 
tics of queries (e.g. “convert int to string” and “convert string to int”). T2API 
is trained on Stack Overflow posts and uses the GraLan language model [17] to 
model dependencies between API calls and synthesize their usage code. 

A different approach to API call recommendation is taken by MULAPI [24]. 
Apart from usage examples, MULAPI also analyzes the source code and API 
libraries of a project to provide an implementation of the requested feature. The 
system also maps the repository of the code to recommend files as locations for 
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the provided API usage code. The architecture of MULAPI comprises a Stanford 
Word Segmenter for text preprocessing and a Vector Space Model to assess the 
similarity between texts. FOCUS [18] is a similar system that analyzes a project’s 
repository and other open source repositories using Abstract Syntax Trees and 
assesses their similarity using Context-Aware Correlation Filter. Next, it mines 
API calls from the most similar repositories and presents them to the developers. 


Other systems treat code recommendation as a machine translation problem. 
One of them is DeepAPI [10], which utilizes a Neural Network architecture to 
transform natural language queries to API sequences. It consists of a recurrent 
neural network (RNN) encoder that processes natural language using attention 
mechanisms and an RNN decoder using an Inverse Document Frequency (IDF)- 
based weighting mechanism to output API sequences. BIKER [4] is a similar 
system that receives natural language queries and assesses their similarity to 
Stack Overflow question posts and API documentation. Post texts and code 
snippets are handled as text and are used to train an embedding model that 
takes into account IDF weights, and recommends relevant API calls. 


Word2API [15] also bridges the semantic gap between natural language and 
code to provide API recommendations. The system creates tuples of method de- 
scriptions and API sequences that are used to train a word embedding model for 
vector generation. A more advanced approach was implemented by DeepAPIRec 
[6]. Its architecture consists of Tree-LSTMs, a long short-term memory (LSTM) 
unit variant that organizes information in an inverse tree structure. DeepAPIRec 
also utilizes a statistical parameter model of data dependency that allows rec- 
ommending parameter values for the APIs suggested by the Tree-LSTM. 


The second category of systems comprises the ones that recommend reusable 
code snippets instead of API calls. One of them is Seahawk [20], an Eclipse plu- 
gin that, given a query, returns a ranked list of relevant Stack Overflow posts. 
The posts are retrieved using Apache Solr and ranked using tf-idf. The snippets 
found in the posts can be integrated into the code of a project. Like Seahawk, 
NLP2Code [5] is an Eclipse plugin that retrieves code snippets from Stack Over- 
flow posts. NLP2Code processes natural language text and snippets using the 
TaskNav algorithm and measures their grammatical correlation with the Stan- 
ford CoreNLP Toolkit. The system receives natural language queries and employs 
a customized version of Google Search Engine for search. StackSearch [8] also 
extracts information from Stack Overlow posts and recommends code snippets 
using a hybrid language model that combines Tf-Idf and fastText [3]. Its results 
are also accompanied with labels extracted using named entity recognition. 


An interesting alternative is DeepCS [9], which recommends reusable code 
snippets given a natural language query. DeepCS employs two RNN encoders, 
one that receives natural language descriptions of methods and one that receives 
a fusion of method names, API sequences and code tokens. Then the system max 
pools the embeddings generated by the two encoders and assesses their similar- 
ity using cosine similarity. DeepCS can understand the semantics of natural 
language and code to a specific extent, however it relies on the generated vectors 
to rank its results without considering more code features such as context. 
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In contrast to systems that utilize raw data dumps from Stack Overflow or 
code repositories, CodeSearchNet [12] introduced a well curated dataset specif- 
ically designed for semantic code search, as it consists of docstring and code 
tokens which highlight their semantics while also facilitating the preprocessing. 
Moreover, it introduced four different baselines, each using a different architec- 
ture for its encoders (Neural Bag-of-Words, Bidirectional RNNs, 1D Convolu- 
tional Neural Networks and Self-Attention). CodeSearchNet outperforms most 
systems due to the quality of its dataset and its powerful neural architectures. 
However, it ignores certain semantics, such as the control flow of the code, so it 
favors keyword-based methods instead of those using semantic information. 

Although the aforementioned systems are effective in certain scenarios, they 
have important limitations. Most of them handle natural language input as key- 
words, i.e. measuring token frequency instead of analyzing semantics and con- 
text. Also, most systems output API calls or API usage code instead of reusable 
snippets. Deep learning systems often do not employ custom similarity met- 
rics and loss functions. CodeTransformer, is trained on high-quality annotated 
data from the CodeSearchNet corpus. It analyzes the query and code semantics 
using word embeddings, generated with state-of-the-art attention mechanisms. 
We employ a hybrid similarity metric and build a custom loss function that are 
suited to the challenge at hand. Thus, our system is able to comprehend relations 
between similar queries (e.g. “how to write to command line” and “how to out- 
put to terminal”) and distinguish queries with lexically minor, yet semantically 
major differences (e.g. “convert int to string” and “convert string to int”). 


3 Semantic Code Search using Machine Translation 


The architecture of our system, shown in Figure 1, comprises four modules: the 
Dataset Builder, the Neural Network, the Index Builder, and the Search Engine. 
The Dataset Builder preprocesses the natural language and code data to produce 
a clean dataset, including the vocabularies of the input and target languages. 
The Neural Network module generates word embeddings and extracts the most 
important features per language using attention mechanisms. 


Search Engine 


Vector Space 


Fig. 1. The architecture of CodeTransformer 
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Max pooling is used on the word embeddings to generate a single embedding 
for each natural language and code sequence. The Index Builder builds a vector 
space containing the sequence embeddings. Each code vector is assigned to an 
index to allow nearest neighbor search when a natural language vector is received. 
The Search Engine receives an input query in the GUI and forwards it to the 
Computations submodule, where the Neural Network analyzes it and generates 
a natural language sequence embedding. This vector representation of the query 
is inserted in the vector space to search for its nearest code vectors. The results 
are forwarded back to the GUI and presented to the user. These modules are 
further analyzed in the following subsections. 


3.1 Data Preprocessor 


Dataset Overview The CodeSearchNet corpus comprises over 6.4 million code 
snippets written in 6 languages, with over 2.3 million of them annotated using 
docstrings [12]. The snippets were extracted from GitHub repositories, and fil- 
tered to remove test functions/constructors, trim long docstrings, and apply 
de-duplication [16,1]. CodeTransformer was implemented using the Java dataset 
of the corpus that contains over 1.5 million snippets, of which over 0.54 million 
come with docstrings. Although we use Java as a proof of concept, it is impor- 
tant to note that our system is mostly language agnostic. Our methodology can 
be applied to other languages, e.g. Python or JavaScript, with minimal changes. 

For each snippet, the dataset contains fields about its origin (repo, path, url, 
sha) and fields concerning its data (original/full string, method name, extracted 
code and docstring). The code and the documentation of the snippet (docstring) 
are also provided as tokens. Table 1 depicts a sample entry of the dataset. 


Table 1. An example entry of the dataset 


Features Data 


func_name JsonObjectDeserializer.getRequiredNode 


/** 

x Helper method to return a {@link JsonNode} from the tree. 
* @param tree the source tree 

* @param fieldName the field name to extract 

* @return the {@link JsonNode} 


*/ 


protected JsonNode getReqNode(JsonNode tree, String fieldName) { 
Assert.notNull(tree, ” Tree must not be null”); 
JsonNode node = tree.get(fieldName); 
code Assert.state(node != null && !(node instanceof NullNode), () —> 
"Missing JSON field ” + fieldName + ””); 
return node; 


} 


docstring 
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After manual inspection, we concluded that the majority of the dataset en- 
tries contain valid natural language docstrings, extracted from each function. 
However, in certain entries the snippets are not properly annotated and in oth- 
ers the automated natural language text extractor has failed to extract the doc- 
string correctly. For instance, in the docstring of Table 1, the extracted docstring 
tokens are [‘helper’, ‘method’, ‘to’, ‘return’, ‘a’, ‘{’]. To avoid having docstrings 
that are incorrect or are not properly tokenized, we first preprocess the dataset. 


Data Preprocessing We create two separate preprocessing pipelines to effec- 
tively target the docstrings and the code data. The regular expressions of Table 
2 enable modifications in the tokens of the dataset. 


Table 2. Regular expressions for preprocessing 


Regex Name Regular Expression 

remove_non_ascii [*\x00-\x7£] 

remove_special [*A-Za-z0-9] + 

seperate_strings [A-Z] [a-z] [“A-Z] * 

fill.empty [A-Z] [a-z] [~A-Z] *| [A-Z] *(?! [a-z] ) | [A-Z] [a-z] [~A-Z] * 
remove_unnecessary (\s) | (") | C7) 1 C/\*) | C/\*\*) 

replace_symbols “LO IN] f3<>+\-*/°%=&1 1 2@\. 525] 


For the removal of noisy natural language data, we designed a pipeline of 
preprocessing steps, as described below: 


1. We remove all the tokens of the docstring located after the first dot symbol 
encounter, thus reducing their size to that of typical natural language queries. 

2. The remove_non_ascii and remove_special expressions are used to replace all 
non-ASCII characters and all special characters, respectively, in the tokens 
of the docstring list with empty characters. 

3. The separate_strings expression is used to separate all the camelCase tokens 
of the docstring list and thus augment the data for the neural network. 

4. We empty all docstring lists that contain less than 6 or more than 30 tokens! 
as inefficient or lengthy, respectively. The lists are filled with the correspond- 
ing camelCase function names and separated using the fill empty expression. 

5. All uppercase characters in the docstring tokens are converted to the cor- 
responding lowercase characters, to achieve structural uniformity between 
tokens with the same meaning but different writing format. 


As an example, the docstring of the snippet shown in Table 1 produces the 
tokens [‘helper’, ‘method’, ‘to’, ‘return’, ‘json’, ‘node’, ‘from’, ‘the’, ‘tree’]. 


1 The limits were defined after studying the data and concluding that most entries 
with inefficient docstrings contained less than 6 docstring tokens, while also noting 
that 30 tokens are adequate for a well-defined description of a function. 
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Concerning noisy code data, we designed a preprocessing pipeline that slight- 


ly differs from those of other systems. Most systems do not sufficiently exploit the 
control flow information of a code snippet. Instead, they solely focus on function 
and variable names, as well as control flow words, such as if, else, for, etc. To fully 
exploit the programming symbols of snippets, we perform the following steps: 


1. 


The remove_non_asctt and separate_strings expressions are used to remove 
all non-ASCII characters and split the text to tokens. 


. We remove all the tokens of the code list that contain space, double quotes, 


or create a comment using the remove_unnecessary expression. 


. We encode programming symbols to unique tokens, as shown in Table 3. 


Table 3. The encoding of programming symbols to unique tokens 


Unique Token # Unique Token # Unique Token 


fe Ae oO | S 


openingparen x multiplyoperator < lessoperator 
closingparen /  divideoperator >= greaterequaloperator 
openingbracket ^ poweroperator <= lessequaloperator 
closingbracket %  modulooperator ++ incrementoperator 
openingbrace = assignoperator —— decrementoperator 
closingbrace == equaloperator !  notoperator 
addoperator ! = notequaloperator @Q atsign 
subtractoperator > greateroperator : semicolon 


. The remove_special regular expression is used to remove all the non-alphanu- 


meric characters in the tokens of the code list with empty characters. This 
step removes symbols that were not replaced in the previous step. 


. We limit the length of the code lists to their first 100 tokens, trimming meth- 


ods of great length and thus enhancing the uniformity of the dataset. Also, all 
uppercase characters in the code tokens are converted to the corresponding 
lowercase ones, as in the docstrings, to favor structural uniformity. 


As an example, the code of the method snippet shown in Table 1 produces 


the tokens shown in Figure 2. 


‘protected’, ‘json’, 'node’, ‘get’, ‘required’, ‘node’, ‘openingparen’, ‘json’, ‘node’, ‘tree’, ‘string’, 
‘field’, ‘name’, 'closingparen’, ‘openingbrace’, ‘assert’, ‘not’, ‘null’, ‘openingparen’, ‘tree’, ‘tree’, 
‘must’, not’, ‘be’, ‘null’, 'closingparen’, semicolon’, ‘json’, node’, ’node’, ‘assignoperator’, 

‘tree’, ‘get’, ‘openingparen’, field’, ‘name’, 'closingparen’, 'semicolon’, ‘assert’, ’state’, ‘open- 
ingparen’, ’notequaloperator’, ‘null’, 'notoperator’, 'openingparen’, 'node’, ‘instanceof’, ‘null’, 
‘node’, 'closingparen’, ‘openingparen’, ‘closingparen’, ‘missing’, ‘json’, ‘field’, ’addoperator’, 

field’, ‘name’, 'addoperator’, 'closingparen’, ‘semicolon’, ‘return’, ‘node’, ‘semicolon’, 'closingbrace’ 


Fig. 2. Example tokens extracted from the code of the method snippet of Table 1 
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Our preprocessing pipeline minimizes the loss of information by performing 
data augmentation on docstrings and code. In docstrings where the information 
is insufficient, the pipeline replaces them with separated camelCase function 
names (e.g. ‘camelCase’ becomes ‘camel case’) that are representative of the 
code. The pipeline also encodes most code symbols to words instead of removing 
them and, thus, reinforces code semantics such as control and data flow. 


3.2 Neural Network 


In this subsection we present the main module of our system, a neural network 
that employs transformers to map natural language queries to source code. 


Network Architecture The main architecture of CodeTransformer is based on 
Matching Networks [23], a neural network architecture designed to solve One- 
Shot Learning problems. Our system, however, follows a slightly different ap- 
proach, as it uses an improved embedding similarity metric and does not require 
an external memory to function. As we discuss in the following subsections, our 
architecture utilizes self-attention encoders and a hybrid geometric similarity 
metric. In contrast to the original approach, ours does not use a softmax func- 
tion on its output, as the similarity metric we selected does not natively support 
it. In Figure 3 we present the architecture of the Neural Network module. 


(BS, SL, D) BS: Batch Size 
Max Pooling: (BS, D) SL: Seq. Length 
Comparison Matrix: (BS, BS) D: Dimensions 


Query 


Code 


Diagonal: Positive Pairs 
Negative Pairs 


Fig. 3. The main architecture of the Neural Network module 


Transformers To maximize the semantic abilities of our system, we employed 
the state-of-the-art Transformers architecture on both of its encoders [22]. A 
Transformer consists of two modules, an encoder and a decoder, with minimal 
architectural differences. Considering the fact that a Matching Network performs 
feature extraction and not direct translation of language data, our implementa- 
tion solely requires encoders for its function. The architecture of the Transformer 
encoder is presented in Figure 4. 

The Transformer encoder comprises an embedding layer, a Positional Encod- 
ing layer and encoder layers, i.e. consecutive blocks of Multi-Head Attention and 
Feed-Forward Network layers. In our implementation, we opted for three stacked 
encoder layers, as they provide sufficient depth for achieving high efficiency. 
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ee 


(| Attention orr 


Positional 
Encoding 


Fig. 4. The architecture of a Transformer encoder 


Before inserting a token sequence to an encoder, we create a vocabulary 
that includes the most frequently occurring words and then encode them to 
integers. We build two vocabularies, each consisting of 10,000 unique words. 
After encoding, we pad each entry with zeros to form tensors of equal dimensions. 
To enhance the generalization capabilities of our system, we reshuffle the dataset 
at the start of every training iteration and divide it in batches of 128. 

When a token sequence is received as input, the encoder embeds the tokens 
in a high-dimensional vector space. In other words, the encoder generates word 
embeddings, i.e. vector representations aiming to extract token information. The 
encoder generates word embeddings of 128 dimensions using an embedding layer. 
The natural language encoder and the source code encoder have identical pa- 
rameter values, but each encoder has its own distinct weights and vocabulary. 
To generate sequence embeddings we use max pooling, as extracts the most 
essential features of the embeddings outputted from the stacked encoder layers. 


Similarity Metric The similarity between natural language and code sequence 
embeddings is usually quantified using the Euclidean distance or the cosine sim- 
ilarity. However, the computation of the Euclidean distance between two vectors 
does not contain any information about the angle between the two vectors. On 
the other hand, cosine similarity does not consider the magnitude of the vectors. 
Our system utilizes a hybrid similarity metric, the Triangle’s Area Similarity 
- Sector’s Area Similarity [11], also known as TS-SS, which improves upon the 
aforementioned metrics by incorporating the Euclidean distance, the magnitude 
difference and the angle between two vectors to compute their similarity. The 
Triangle’s Area Similarity (TS) comprises the Euclidean distance, the magnitude 
of each vector and the angle between them, while the Sector’s Area Similarity 
(SS) provides the magnitude difference. The TS of two vectors A and B is: 


rs(4, p) = ABL) a) 


where, given @ is the angle between the two vectors, 6’ is defined as cos”! (9)+10°. 
We use 0’ instead of 6 so that the computation is valid in the case of overlapping 
vectors (when 8 = 0). The SS of two vectors A and B is defined as: 


SS (A, B) = x(ED(A,B)+MD(A,B))’ - (5) (2) 


where 0’ is defined as above, while ED (A, B) and MD (A, B) correspond to 
the Euclidean distance and the magnitude difference between the two vectors, 
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respectively. Given the dimension of the vectors N, the magnitude difference is: 


N N 
MD(A,B)=|,|5_ 42 -|J B2 (3) 
n=l n=1 


Merging TS and SS via addition is not possible, as they are in different scale. 
According to Heidarian and Dinneen [11], their multiplication establishes a new 
scale that sufficiently represents similarity. Consequently, TS-SS is computed as: 


|A - |B| - sin (0’) -0' -x - (ED (A, B) + MD(A,B))’ 4 
720 “ 
TS-SS values range from 0 to infinity, with 0 indicating that two vectors are 
identical. Accordingly, the TS-SS value of two dissimilar vectors is larger than 
zero, without any limitations. In our implementation, we decided to calculate the 
reciprocal TS-SS in favor of the custom loss function we use during our network’s 
training process. The final similarity of the two vectors is computed as: 


1 
TS—SS(A,B) (5) 


TS—SS(A, B) = 


Similarity (A, B) = 


Loss Function The neural network of CodeTransformer outputs a square sim- 
ilarity matrix, where each row represents a natural language embedding and 
each column represents a source code embedding. The diagonal matrix cells cor- 
respond to the positive pairs of natural language and source code and their values 
ought to be high. The rest of the matrix cells correspond to the negative pairs, 
and their values ought to be low. At network initialization, all embeddings con- 
tain random values and are scattered throughout the vector space. As a result, 
in order to bring all similar embeddings closer during training, we need to utilize 
a loss function that is based on the computations of the reciprocal TS-SS. 

A loss function typically used by similar systems (such as CodeSearchNet 
[12] and DeepCS [9]) is a variation of Hinge loss, computed as follows: 


Loss = max (0,1 — positive + negative) (6) 


Upon testing this variation of Hinge loss, we observed that it did not result in 
successful integration with the vanilla or reciprocal TS-SS. Even after modifying 
the function’s margin to a value larger than 1, due to TS-SS infinite value range, 
the result was always the same. The embeddings constantly collapsed to a specific 
point, not allowing distinct sequence embeddings for each positive pair. 

This led us to design a custom loss function, based on the squared variation 
of Hinge loss. We name this loss function Squared Margin Loss and define it as: 


Loss = (max (0, margin — positive))* + negative? (7) 
Furthermore, the derivatives of our loss function are defined as follows: 


(8) 


o f - (margin — positive), if positive < margin 
Loss = 


ð (positive) i otherwise 
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o 


——— L = 2s ti 
Ta oss negative (9) 


The Squared Margin Loss encourages the penalization of larger loss values more, 
and the penalization of smaller loss values less. Thus, the function ensures the 
convergence of the network at first epochs and its optimization at later epochs. 
By further restricting the function with the function maz, the positive pairs of 
similarity value above the margin do not take part in the computation of the loss. 
In this case, however, the similarity values of the corresponding negative pairs 
continue to decrease. This allows the similarity values of the diagonal to increase 
further than the margin. Without the use of the max function, the elements that 
have crossed the margin would generate useless losses and positive gradients, 
resulting in the fluctuation of their similarity values around the margin. 


Optimizer We train our neural network using the Adaptive Moment Estima- 
tion (Adam) optimizer [14], which computes adaptive learning rates for each 
parameter. Adam stores the exponentially decaying average of past gradients 
and the exponentially decaying average of past squared gradients. Using Adam 
ensures that the network converges fast through momentum estimation. The 
convergence also depends on the learning rate; a poor choice of its value can 
slow down the training process, or even derail the network’s weights. To find the 
ideal learning rate, we examined a range of values generated by the equation: 


LearningRate = 1.15!?/10 . 10710 (10) 


This function generates values starting from 10~!° up to a practically infinite 
value. The learning rate is multiplied by 1.1 once every 100 training steps. 
After plotting the accuracy and loss per training step, we noticed a point 
with a steep increase in accuracy and a steep decrease in loss as well as a point 
with a steep decrease in accuracy and a steep increase in loss. Next, we isolated 
the values between these steps and tested those closer to the lower end, where 
the increase in accuracy and decrease in loss occur. Through trial and error, we 
selected a learning rate value of 3.2- 1074. We set the margin of our network to 
5, the number of heads to 8, and the dff to 512, and trained the network for 40 
epochs, as these have been shown to be enough for the efficiency of the results. 


3.3 Index Builder 


Due to the complexity of our neural network and the number of its parameters, 
fast response times cannot be guaranteed. To significantly reduce the processing 
time of our system, we employed Annoy [2], a tool using a Nearest Neighbor 
Search algorithm. Using Annoy in the Index Builder module allows us to generate 
a vector space that contains all the source code embedding vectors of the corpus. 

Annoy assigns an index on all code embeddings and then assorts them based 
on their values by building up a forest of trees. The vector dimension of the 
vector space is set according to the dimension of the output embedding, which is 
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128. We calculated the Euclidean distance between the vectors and built 10 trees. 
Regarding the search process in the vector space, we select the first 100 nearest 
vectors out of the 10.000 nearest forest nodes. Thus, instead of calculating the 
similarity between a query and the whole corpus using the neural network, Annoy 
compares the query vector with the nearest 10.000 code vectors. In addition, 
Annoy’s search time does not seem to be hindered by the embedding dimension. 
The search process of a query is executed in three stages. Firstly, the query of 
the user is preprocessed, so that non-alphanumeric symbols are removed, camel- 
Case tokens are separated and uppercase characters are lowercased. Secondly, 
every query token is encoded as an integer to be passed as input to the neural 
network. The neural network, in inference mode, generates the sequence embed- 
ding of the query to be inserted to the vector space of the Index Builder. Finally, 
Annoy extracts the indices of the 10 code vectors nearest to the query, and the 
corresponding code snippets and GitHub URLs are presented to the user. 


4 Evaluation 


We evaluate our system using two different datasets, the Java corpus of Code- 
SearchNet [12], and a set of popular Java questions from Stack Overflow?. 

The performance of our system is assessed using the Precision at K (PQK), 
the Mean Reciprocal Rank (MRR) [7] and the Normalized Discounted Cumula- 
tive Gain (NDCG) [13]. PQK indicates how many out of the first K results are 
relevant to the query. MRR further incorporates the order of the results, com- 
puted as the mean of the reciprocal rank of each query (the reciprocal rank of 
the i-th query is 1/rank;, where rank; is the rank position of the first relevant 
document). The NDCG is the normalized DCG, computed for N results as: 


rel; 
Dee) een ? (11) 


where rel; is the graded relevance of the result at position i. Thus, NDCG is 
computed dividing the result of equation (11) by the ideal DCG, i.e. the one 
produced if all the results in the list were sorted in the correct order. 


4.1 Evaluation using CodeSearchNet Queries 


CodeTransformer employs the CodeSearchNet corpus [12] for training and infer- 
ence, allowing its direct comparison with the implementations of CodeSearchNet. 
CodeSearchNet comprises four different encoder architectures. One of them is the 
Self-Attention (SelfAtt) architecture, which was examined in the previous sec- 
tion. The Neural Bag of Words (NBoW) architecture measures word occurrence 
within a document, therefore it performs well on keyword-based search opera- 
tions. The 1D Convolutional Neural Network (1D-CNN) architecture learns to 


? The code and details used to reproduce our findings can be found at the repository: 
https://github.com/AuthEceSoftEng/CodeTransformer 
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recognize complex, non-linear patterns. In contrast to NBoW and 1D-CNN, the 
Bidirectional RNN (biRNN) architecture further models the word order. 

The four implementations are compared to CodeTransformer on the test set 
of the CodeSearchNet corpus, which includes 15000 docstring and code snippet 
pairs, for the computation of MRR. Additionally, the four implementations are 
compared to CodeTransformer using 99 annotated queries provided by Code- 
SearchNet for computing NDCG. The results are shown in Table 4. Note that, 
although our system is not directly compared with DeepCS [9] as the systems use 
different data, we compare it with the biRNN implementation of CodeSearchNet 
that has a similar neural architecture with DeepCS. 


Table 4. Evaluation results of CodeTransformer and CodeSearchNet 


System MRR NDCG 


CodeSearchNet-NBoW 0.5140 0.1207 
CodeSearchNet-1D-CNN 0.5270 0.1282 
CodeSearchNet-biRNN 0.2865 0.0623 
CodeSearchNet-SelfAtt 0.5866 0.1003 
CodeTransformer 0.6263 0.1028 


Concerning MRR, our system outperforms CodeSearchNet measurements, in- 
dicating that the different strategies followed for our data pipeline are effective. 
Another factor that may contribute to this result is our preprocessing method- 
ology, as it may be possible that the replacement of insufficient docstrings with 
function names led to increased MRR values. As a side note, these results were 
also clear during the validation phase of the algorithms (e.g. the MRR, of Code- 
Transformer for the validation set was the highest at 0.62604, while the second 
highest was that of CodeSearchNet-SelfAtt at 0.5513). 

Concerning NDCG, our system performs slightly better compared to the cor- 
responding Self-Attention implementation of CodeSearchNet, while the NBoW 
and 1D-CNN implementations perform better than CodeTransformer, possibly 
because they use docstrings as natural language. However, we note that only a 
small amount of data was annotated for the computation of NDCG (i.e. only 
823 out of 1.5 million Java code snippets). In addition, as the authors of Code- 
SearchNet note [12], the annotated data were selected using the top 10 results 
per query, generated by an ensemble of the CodeSearchNet neural models and 
ElasticSearch, therefore they are what these systems are more likely to produce. 
Hence, it is possible that correct results are ignored for computing NDCG. 

Figure 5 depicts the distribution and the individual MRR values for 99 queries 
of the test set of CodeSearchNet [12]. As the annotations were not provided, we 
annotated the first 10 results returned by our system to compute the MRR. The 
majority of MRR values are equal to 1, indicating that our system returns a rel- 
evant result in the first position for more than half of the queries. By examining 
the results, we found that our system effectively models the semantic informa- 
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tion of the text and the code snippets. Indicatively, for Q64, CodeTransformer 
outputs a function that sorts an array using another array’s order, even though 
almost none of the exact words of the query are present in the code (except 
for the word “sort” ). Semantically similar terms are also effectively interpreted. 
E.g., for Q16 that requests exporting data to an excel file, our system returns an 
exportXls method, thus modeling the semantic similarity between terms “excel” 
and “xls”. Similarly, given Q91 that requests data extraction from a text file, 
CodeTransformer returns a method using the term “read” instead of “extract”. 
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Q01: convert int to string 1.000 Q51: how to randomly pick a number 1.000 
Q02: priority queue 0.500 Q52: normal distribution 1.000 
Q03: string to date 1.000 Q53: nelder mead optimize 0.000 
Q04: sort string list 1.000 Q54: hash set for counting distinct elements 0.000 
Q05: save list to file 0.500 Q55: how to get database table name 1.000 
Q06: postgresql connection 1.000 Q56: deseria ize json 0.500 
Q07: confusion matrix 1.000 Q57: find int in string 1.000 
Q08: set working directory 1.000 Q58: get current process id 1.000 
Q09: group by count 0.000 Q59: regex case insensitive 0.160 
Q10: binomial distribution 0.200 Q60: custom http error response 1.000 
Q11: aes encryption 1.000 Q61: how to determine a string is a valid word 0.200 
Q12: linear regression 1.000 Q62: html entities replace 0.330 
Q13: socket recv timeout 0.000 Q63: set file attrib hidden 0.330 
Q14: write csv 1.000 Q64: sorting arrays based on another arrays order? 1.000 
Q15: convert decimal to hex 1.000 Q65: string similarity levenshtein 1.000 
Q16: export to excel 1.000 Q66: how to get html of website 0.000 
Q17: scatter plot 1.000 Q67: buffered file reader read text 0.500 
Q18: convert json to csv 0.160 Q68: encrypt aes ctr mode 0.000 
Q19: pretty print json 1.000 Q69: matrix multiply 0.500 
Q20: replace in file 0.500 Q70: print model summary 0.000 
Q21: k means clustering 1.000 Q71: unique elements 0.500 
Q22: connect to sql 1.000 Q72: extract data from html content 0.000 
Q23: html encode string 1.000 Q73: heatmap from 3d coordinates 0.000 
Q24: finding time elapsed using a timer 0.125 Q74: get all parents of xml node 0.000 
Q25: parse binary file to custom class 0.500 Q75: how to extract zip file recursively 1.000 
Q26: get current ip address 1.000 Q76: underline text in label widget 0.000 
Q27: convert int to bool 0.250 Q77: unzipping large files 0.200 
Q28: read text file line by line 1.000 Q78: copying a file to a path 0.500 
Q29: get executable pat 1.000 Q79: get the description of a http status code 1.000 
Q30: httpclient post json 0.500 Q80: randomly extract x items from a list 1.000 
Q31: get inner html 0.500 Q81: convert a date string into yyyymmdd 0.330 
Q32: convert string to number 0.000 Q82: convert a utc time to epoch 1.000 
Q33: format date 1.000 Q83: all permutations of a list 1.000 
Q34: readonly array 0.000 Q84: extract latitude and longitude from given input 1.000 
Q35: filter array 1.000 Q85: how to check if a checkbox is checked 0.000 
Q36: map to json 0.500 Q86: converting uint8 array to image 0.125 
Q37: parse json file 0.330 Q87: memoize to disk - persistent memoization 0.000 
Q38: get current observable value 0.140 Q88: parse command line argument 1.000 
Q39: get name of enumerated value 1.000 Q89: how to read contents of a .gz compressed file? 0.000 
Q40: encode url 1.000 Q90: sending binary data over a serial connection 1.000 
Q41: create cookie 1.000 Q91: extracting data from a text file 1.000 
Q42: how to empty array 1.000 Q92: positions of substrings in string 0.000 
Q43: how to get current date 1.000 Q93: reading element from html - <td> 0.000 
Q44: how to make the checkbox checked 1.000 Q94: deducting the median from each column 1.000 
Q45: initializing array 1.000 Q95: concatenate several file remove header lines 0.000 
Q46: how to reverse a string 1.000 Q96: parse query string in url 1.000 
Q47: read properties file 1.000 Q97: fuzzy match ranking 0.000 
Q48: copy to clipboard 1.000 Q98: output to html file 0.000 
Q49: convert html to pdf 0.000 Q99: how to read .csv file in an efficient way? 0.200 


Q50: json to xml conversion 


Fig. 5. MRR values of CodeTransformer for the 99 queries of CodeSearchNet dataset 
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Concerning queries for which our system did not perform as effectively, some 
of them are relevant to other programming languages and/or are not included in 
the corpus. Note that these 99 queries are drawn from 6 languages and thus not 
all of them are relevant to Java. An example unanswered query is Q34, as read- 
only arrays do not exist in Java and, therefore, a relevant code snippet is not 
included in the corpus. After manually inspecting the corpus, we concluded that 
Java code snippets for queries Q53, Q68, Q70, Q73, Q76 and Q97 are focused on 
other languages. This is also the case for HTML parsing queries, such as queries 
Q49, Q66, Q72, Q93 and Q98, for which we could find a few Java methods by 
manual inspection, however they are mainly targeted at other languages. In any 
case, considering the results of Table 4 and Figure 5, CodeTransformer seems to 
provide a relevant answer in the first two positions more often than not. 

Finally, as a proof of concept, Table 5 depicts the declarations of the methods 
returned by our system for query Q91, which refers to “extracting data from a 
text file”. It is clear that the methods respond effectively to the query. 


Table 5. Declarations of the methods returned by CodeTransformer for query Q91 
“extracting data from a text file” 


# Method Declaration 


1 public static String readTextFile(Context context, int resld) 

2 public static String readTextFile(Context context, String asset) 

3 public DataSource<String> readTextFile(String filePath) 

4 public static String readTextFile(String fileName) 

5 public static String readTextFile(File file) 

6 public DataSource<String> readTextFile(String filePath, String charsetName) 

7 public static String readTextFile(Context context, int resourceld) 

8 public static String readTextFile(File file) throws IOException 

9 private ProjectFile readTextFile(InputStream inputStream) throws MPXJException 
10 public DataStreamSource<String> readTextFile(String filePath, String charsetName) 


4.2 Evaluation using Stack Overflow Questions 


To further evaluate CodeTransformer, we reviewed its performance on real user 
queries. Although our model uses docstrings instead of real queries, we consider 
this experiment adequate for assessing its effectiveness as a proof of concept. 
We manually selected the first 40 highest-rated Stack Overflow posts at the 
time of research, in which the posters search for Java code snippets. After query- 
ing our system using their titles, we obtained 10 results for each query, sorted 
by their similarity to the query. Next, we manually annotated the similarity of 
each result to the query, making sure that the result is a valid answer. To avoid 
any threats to validity, the annotations were performed without knowledge of 
the order of the results. Table 6 depicts the questions as well as the rank of the 
first relevant result and the precision at the first 10 results for each question. 
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Table 6. Evaluation results of CodeTransformer on the set of the 40 most popular 
Stack Overflow Java questions 


# Questions Rank P@10 


S01 How do I read / convert an InputStream into a String in Java? 1 1.0 
S02 Create ArrayList from array 2 0.7 
S03 How do I generate random integers within a specific range in Java? 2 0.7 
S04 Iterate through a HashMap [duplicate] 2 0.2 
S05 How do I efficiently iterate over each entry in a Java Map? 2 0.1 
S06 How do I convert a String to an int in Java? 1 0.2 
S07 Initialization of an ArrayList in one line — — 
S08 How do I determine whether an array contains a value in Java? 3 0.3 
S09 How do I call one constructor from another in Java? — — 
S10 How do I declare and initialize an array in Java? 


S11 How to get an enum value from a string value in Java? 1 1.0 
S12 What’s the simplest way to print a Java array? 1 1.0 
S13 How to generate a random alpha-numeric string? 1 0.8 
S14 How to split a string in Java 1 1.0 
S15 Sort a Map<Key. Value> by values 7 0.1 
S16 How do I create a Java string from the contents of a file? 7 0.1 
S17 How can I convert a stack trace to a string? 1 0.8 
S18 Fastest way to determine if an integer’s square root is integer — — 
S19 How do I create a file and write to it in Java? 3 0.3 
S20 How can I concatenate two arrays in Java? 1 0.7 
S21 How to round a number to n decimal places in Java 1 0.8 
S22 Convert ArrayList<String> to String|] array 1 0.7 
$23 Sort ArrayList of custom Objects by property 1 0.5 


$24 How can I initialise a static Map? 

S25 How to directly initialize a HashMap (in a literal way)? 
S26 How to create a generic array in Java? 

S27 How to parse JSON in Java 

$28 Converting array to list in Java 

S29 How to get the current working directory in Java? 

S30 Converting ‘ArrayList<String>’ to ‘String[]’ in Java 
S31 How can I pad an integer with zeros on the left? 

S32 How can I get the current stack trace in Java? 

$33 Java 8 List<V> into Map<K. V> 


PrPOrRrRPrFE- | 
© 
Ne) 


S34 Reading a plain text file in Java 1 1.0 
S35 How to check if a String is numeric in Java 3 0.7 
$36 Java string to date conversion 1 0.9 
$37 A ‘for’ loop to iterate over an enum in Java = — 
S38 How do I convert a String to an InputStream in Java? — — 
S39 Convert InputStream to byte array in Java 2 0.7 
S40 How can I read a large text file line by line using Java? 1 0.7 


Precision at the first 10 results is relatively high for most queries. Moreover, 
we may note that CodeTransformer effectively disambiguates among queries with 
similar context. Consider, e.g., queries S17 and $32 that are both relevant to 
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stack traces; although these queries are similar, the system was able to com- 
prehend the semantics of each query and return several highly ranked relevant 
results. Even for queries with low precision in their results, CodeTransformer 
placed the first relevant result in the first or the second position. Thus, even 
though for some queries there are not many relevant results, the users typi- 
cally receive at least one correct answer. An example would be query S06, for 
which the system returned only two relevant results, but one of them is ranked 
in the first place. It is also notable, in the same query, that CodeTransformer 
distinguishes among converting “string to integer” and “integer to string”. 

The fact that 8 out of 40 questions were not answered at all occurs mostly 
because a matching function does not exist in the corpus. For example, queries 
S07, S09, S10, S24, and $37 do not require a whole method for their imple- 
mentation and, thus, the corpus does not include relevant code snippets. Other 
queries may be too complex, such as query $18, for which our system returns 
some relevant code snippets, however these results do not meet the condition of 
the fastest way to examine if an integer’s square root is an integer. 

In Table 7 we provide three example Stack Overflow queries and the corre- 
sponding relevant answers. For the first two queries, CodeTransformer has placed 
the answers at the first position, while for the third query the answer was placed 
at the second position. As shown by these examples, CodeTransformer indeed 
retrieves and recommends useful snippets in a question-answering scenario. 


5 Conclusion 


Although there are several approaches for code snippet retrieval, most of them 
do not consider semantics of natural language and code, ignoring essential in- 
formation regarding the data. Furthermore, several of them recommend API 
calls or sequences instead of reusable code snippets, requiring more effort from 
the developer. Deep learning systems are usually more effective, however most 
do not employ advanced neural transformer architectures and are limited by 
the fact that they are not trained on annotated datasets. Our system, Code- 
Transformer, overcomes these limitations by employing a state-of-the-art neural 
network architecture. The advanced attention mechanisms of this architecture, 
including specialized similarity metric and custom loss function, along with the 
preprocessing pipeline specifically designed to augment natural language and 
code semantics, allow the system to generate powerful data representations. 
Upon evaluating CodeTransformer against the implementations of Code- 
SearchNet, we found that our system is more effective, especially when the devel- 
oper would prefer to receive the method most relevant to the query rather than 
a list of related methods. We further assessed CodeTransformer on a dataset 
of actual questions from Stack Overflow, with the results indicating that it is 
capable of retrieving useful code, even for complex natural language queries. 
For future work, we consider implementing our network using real-life natu- 
ral language data, such as Stack Overflow questions, instead of code documenta- 
tion. In addition, we could train our network using other (less curated) datasets 
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Table 7. Example Stack Overflow queries and the answers of CodeTransformer 


Features Data 


Query S11 How to get an enum value from a string value in Java? 


public static T getEnumFromString(Class c, String string) { 
if (c != null && string != null) { 
Result return Enum.valueOf(c, string.trim().toUpperCase()); 


return null; 


} 


Query 520 How can I concatenate two arrays in Java? 


public static String[] concat(String[] array1, String[] array2) { 
int lengthl = arrayl.length; 
int length2 = array2.length; 
int length = length1 + length2; 

Result String[] dest = new String[length]; 
System.arraycopy(arrayl, 0, dest, 0, length1); 
System.arraycopy(array2, 0, dest, length1, length2); 
return dest; 


J 


Query $36 Java string to date conversion 


public static Date serviceStringToDate(String s) { 
if (s == null) 
return null; 


try { 
return new SimpleDateFormat(_serviceDateFormat).parse(s); 


Result 
catch (Exception e) { 
return null; 
} 
} 


and explore different preprocessing techniques, incorporating the semantics of 
programming symbols and the information provided by method names to the 
natural language data. Finally, we could explore whether our system can gen- 
erate docstrings by providing code snippets as input to the code encoder and 


comparing their sequence embeddings to docstring sequence embeddings. 
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Abstract. Automatic Speech Recognition (ASR) systems have become 
ubiquitous. They can be found in a variety of form factors and are in- 
creasingly important in our daily lives. As such, ensuring that these sys- 
tems are equitable to different subgroups of the population is crucial. In 
this paper, we introduce, AEQUEVOx, an automated testing framework 
for evaluating the fairness of ASR systems. AEQUEVOX simulates differ- 
ent environments to assess the effectiveness of ASR systems for different 
populations. In addition, we investigate whether the chosen simulations 
are comprehensible to humans. We further propose a fault localization 
technique capable of identifying words that are not robust to these vary- 
ing environments. Both components of AEQUEVOox are able to operate 
in the absence of ground truth data. 

We evaluate AEQUEVOX on speech from four different datasets using 
three different commercial ASRs. Our experiments reveal that non-native 
English, female and Nigerian English speakers generate 109%, 528.5% 
and 156.9% more errors, on average than native English, male and UK 
Midlands speakers, respectively. Our user study also reveals that 82.9% of 
the simulations (employed through speech transformations) had a com- 
prehensibility rating above seven (out of ten), with the lowest rating 
being 6.78. This further validates the fairness violations discovered by 
AEQUEVoOx. Finally, we show that the non-robust words, as predicted 
by the fault localization technique embodied in AEQUEVOX, show 223.8% 
more errors than the predicted robust words across all ASRs. 


1 Introduction 


Automated speech recognition (ASR) systems have made great strides in a vari- 
ety of application areas e.g. smart home devices, robotics and handheld devices, 
among others. The wide variety of applications have made ASR systems serve in- 
creasingly diverse groups of people. Consequently, it is crucial that such systems 
behave in a non-discriminatory fashion. This is particularly important because 
assistive technologies powered by ASR. systems are often the primary mode of 
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Fig. 1: Fairness Testing in AEQUEVOX 


interaction for users with certain disabilities [20]. Consequently, it is critical that 
an ASR system employed in such systems is effective in diverse environments 
and across a wide variety of speakers (e.g. male, female, native English speak- 
ers, non-native English speakers) since they are often deployed in safety-critical 
scenarios [18]. 

In this paper, we are broadly concerned with the fairness properties in ASR 
systems. Specifically, we investigate whether speech from one group is more ro- 
bustly recognised as compared to another group. For instance, consider the exam- 
ple shown in Figure 1 for a system ASR. The metric ASR grr captures the error 
rate induced by ASR. Consider speech from two groups of speakers i.e. male and 
female. We assume that the ASR has similar error rates for both the groups of 
speakers, as illustrated in the upper half of Figure 1. We now apply a small, 
constant perturbation on the speech provided by the two groups. Such a per- 
turbation can be, for instance, addition of small noise, exemplifying the natural 
conditions that the ASR systems may need to work in (e.g. a noisy environment). 
If we observe that the ASR grr increases disproportionately for one of the speaker 
groups, as compared to the other, then we consider such a behaviour a violation 
of fairness (see the second half of Figure 1). Intuitively, Figure 1 exemplifies the 
violations of Equality of Outcomes [36] in the context of ASR. systems, where the 
male group is provided with a higher quality of service in a noisy environment 
as compared to the female group. Automatically discovering such scenarios of 
unfairness via simulating the ASR service in diverse environments is the main 
contribution of our AEQUEVOX framework. 

AEQUEVOX facilitates fairness testing without having any access to ground 
truth transcription data. Although, text-to-speech (TTS) can be used for gener- 
ating speech, we argue that it is not suitable for accurately identifying the bias 
towards speech coming from a certain group. Specifically, speakers may inten- 
tionally use enunciation, intonation, different degrees of loudness or other aspects 
of vocalization to articulate their message. Additionally, speakers unintentionally 
communicate their social characteristics such as their place of origin (through 
their accent), gender, age and education. This is unique to human speech and 
TTS systems cannot faithfully capture all the complexities inherent to human 
speech. Therefore, we believe that fairness testing of ASR systems should involve 
speech data from human speakers. 

We note that human speech (and the ASRs) may be subject to adverse en- 
vironments (e.g. noise) and it is critical that the fairness evaluation considers 
such adverse environments. To facilitate the testing of ASR systems in adverse 
environments, we model the speech signal as a sinusoidal wave and subject it 
to eight different metamorphic transformations (e.g. noise, drop, low/high pass 
filter) that are highly relevant in real life. Furthermore, in the absence of man- 
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ually transcribed speech, we use a differential testing methodology to expose 
fairness violations. In particular, AEQUEVOx identifies the bias in ASR systems 
via a two step approach: Firstly, AEQUEVOx registers the increase in error rates 
for speech from two groups when subjected to a metamorphic transformation. 
Subsequently, if the increase in the error rate of one group exceeds the other by a 
given threshold, AEQUEVOx classifies this as a violation of fairness. To the best 
of our knowledge, we are unaware of any such differential testing methodology. 
As a by product of our AEQUEVOX framework, we highlight words that con- 
tribute to errors by comparing the word counts from the original speech. This 
information can be further used to improve the ASR system. 

Existing works [17,49] isolate certain sensitive attributes (e.g. gender) and 
use such attributes to test for fairness. Isolating these attributes is difficult in 
speech data, making it challenging to apply existing techniques to evaluate the 
fairness of ASR systems. AEQUEVOX tackles this by formalizing a unique fair- 
ness criteria targeted at ASR systems. Despite some existing efforts in testing 
ASR systems [5,13], these are not directly applicable for fairness testing. Ad- 
ditionally, some of these works require manually labelled speech transcription 
data [13]. Finally, differential testing via TTS [5] is not appropriate to deter- 
mine the bias towards certain speakers, as they might use different vocalization 
that might be impossible (and perhaps irrational) to generate via a TTS. In 
contrast, AEQUEVOX works on speech signals directly and defines transforma- 
tions directly on these signals. AEQUEVOX also does not require any access to 
manually labelled speech data for discovering fairness violations. In summary, 
we make the following contributions in the paper: 


1. We formalize a notion of fairness for ASR systems. This formalization draws 
parallels between the Equality of Outcomes [36] and the quality of service 
provided by ASR systems in varying environments. 

2. We present AEQUEVOX, which systematically combines metamorphic trans- 
formations and differential testing to highlight whether speech from a cer- 
tain group (e.g. female) is subject to fairness violations by ASR systems. 
AEQUEVOxX neither requires access to ground truth transcription data nor 
does it require access to the ASR model structures. 

3. We propose a fault localization method to identify the different words con- 
tributing to fairness errors. 

4, We evaluate AEQUEVOX with three different ASR systems namely Google 
Cloud, Microsoft Azure and IBM Watson. We use speech from the Speech Ac- 
cent Archive [54], the Ryerson Audio-Visual Database of Emotional Speech 
and Song (RAVDESS) [30], Multi speaker Corpora of the English Accents in 
the British Isles (Midlands) [11], and a Nigerian English speech dataset [2]. 
Our evaluation reveals that speech from non-native English speakers and 
female speakers exhibit higher fairness violations as compared to native En- 
glish speakers and male speakers, respectively. 

5. We validate the fault localization of AEQUEVOx by showing that the identi- 
fied faulty words generally introduce more errors to ASR systems even when 
used within speech generated via TTS systems. The inputs to the TTS sys- 
tem are randomly generated sentences that conform to a valid grammar. 
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Table 1: Notations used 


Notation Description 


GReg Base group 


GRk k € (1,n). Various comparison groups 

MT Metamorphic transformations 

ASR Automatic Speech Recognition system under test 

T A user specified threshold beyond which the difference in word error rate for the base and comparison 


groups is considered a violation of individual fairness 


6. We evaluate (via the user study) the human comprehensibility score of the 
transformations employed by AEQUEVOX on the speech signal. The lowest 
comprehensibility score was 6.78 and 82.9% of the transformations had a 
comprehensibility score of more than seven. 


2 Background 


In this section, we introduce the necessary background information. 


Fairness in ASR Systems: A recent work, FairSpeech [26], uses conversa- 
tional speech from black and white speakers to find that the word error rate for 
individuals who speak African American Vernacular English (AAVE) is nearly 
twice as large in all cases. 


Testing ASR Systems: The major testing focus, till date has been on image 
recognition systems and large language models. Few papers have probed ASR 
systems. One such work, Deep-Cruiser [13] applies metamorphic transformations 
to audio samples to perform coverage-guided testing on ASR systems. Iwama et 
al. [23] also perform automated testing on the basic recognition capabilities of 
ASR systems to detect functional defects. CrossASR [5] is another recent paper 
that applies differential testing to ASR systems. 


The Gap in Testing ASR Systems: There is little work on automated meth- 
ods to formalise and test fairness in ASR systems. In this work, we present AE- 
QUEVox to test the fairness of ASR systems with respect to different population 
groups. It accomplishes this with the aid of differential testing of speech samples 
that have gone through metamorphic transformations of varying intensity. Our 
experimentation suggests that speech from different groups of speakers receives 
significantly different quality of service across ASR systems. In the subsequent 
sections, we describe the design and evaluation of our AEQUEVOX system. 


3 Methodology 


In this section, we discuss AEQUEVOX in detail. In particular, we motivate and 
formalize the notion of fairness in ASR systems. Then, we discuss our methodol- 
ogy to systematically find the violation of fairness in ASR systems. The notations 
used are described in Table 1. 


Motivation: Equality of outcomes [36] describes a state in which all people have 
approximately the same material wealth and income, or in which the general 
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economic conditions of everyone’s lives are alike. For a software system, equality 
of outcomes can be thought of as everyone getting the same quality of service 
from the software they are using. For a lot of software services, providing the 
same quality of service is baked into the system by design. For example, the 
results of a search engine only depend on the query. The quality of the result 
generally does not depend on any sensitive attributes such as race, age, gender 
and nationality. In the context of an ASR, the quality of service does depend 
on these sensitive attributes. This inferior quality of service may be especially 
detrimental in safety-critical settings such as emergency medicine [18] or air 
traffic management [27,21]. 

In our work, we show that the quality of service provided by ASR systems 
is vastly different depending on one’s gender/nationality /accent. Suppose there 
are two groups of people using an ASR system, males and females. They have 
approximately the same level of service when using this service at their homes. 
However, once they step into a different environment such as a noisy street, the 
quality of service drops notably for the female users, but does not drop noticeably 
for the male users. This is a violation of the principle of equality of outcomes 
(as seen for software systems) and more specifically, group fairness [14]. Such 
a scenario is unfair (violation of group fairness) because some groups enjoy a 
higher quality of service than others. 

In our work, we aim to automate the discovery of this unfairness. We do this 
by simulating the environment where the behaviour of ASR systems are likely to 
vary. The simulated environment is then enforced in speech from different groups. 
Finally, we measure how different groups are served in different environments. 


Formalising Fairness in ASRs: In this section, we formalise the notion of 
fairness in the context of automated speech recognition systems (ASRs). The 
fairness definition in ASRs is as follows: 


|ASRz,-(GR;) — ASRz,-(GR,;)| < T (1) 


Here, GR; and GR; capture speech from distinct groups of people. If the er- 
ror rates induced by ASR for group GR; (ASR perr(GR;)) and for group GR; 
(ASRerr(GR;)) differ beyond a certain threshold, we consider this scenario to 
be unfair. Such a notion of unfairness was studied in a recent work [26]. 

In this work, we want to explore whether different groups are fairly treated 
under varying conditions. Intuitively, we subject speech from different groups to 
a variety of simulated environments. We then measure the word error rates of the 
speech in such simulated environments and check if certain groups fare better 
than others. Formally, we capture the notion of fairness targeted by AEQUEVOX 
as follows: 

D; ASR grr(GR;) = ASRerr(GRi + ô) 
D; e ASR grr(GR;) a ASR grr( GR; +ô) (2) 
Di- D| <T 


Here we perturb the speech of the two groups (GR; and GR;) by adding some 
ô to the speech. We compare the degradation in the speech (D; and D,). If the 
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Algorithm 1 AEQUEVox Fairness Testing 
1: procedure Fairness TeEstTinc(GRg, MT,GR,--- , GRn, T, ASR, ASR2) 


2; Error_ Set + Ø 

3: for T € MT do 

4: GRI + T(GRs) 

5: > L computes the average word level levenshtein distance 
6: > between the outputs of ASR, and ASR2 

T: dg + L(ASRı(GRg), ASR2(GRpB)) 

8: d} + L(ASRı(GR}), ASR2(GR3)) 

9: Dg + d% — dg 

10: for k € (1,n) do 

11: GRE + T(GRx) 

12: dk + L(ASRı(GRp), ASR2(GR,)) 

13: d] + L(ASRı(GR]), ASR2(GR7)) 

14: Dr + d} — dp 

15: if Dg — Dk > T then 

16: Error_ Set + Error_SetU(GRg, GRk,T) 
17: end if 

18: end for 

19: end for 

20: return Error_ Set 


21: end procedure 


degradation faced by one group is far greater than the one faced by the other, 
we have a fairness violation. This is because speech from both groups ought to 
face similar degradation when subject to similar environments (simulated by 6 
perturbation) when equality of outcomes [36] holds. More specifically, this is a 
group fairness violation because the quality of service (outcome) depends on the 
group [14,51]. 


Example: To motivate our system, let us sketch out an example. Consider texts 
of approximately the same length spoken by two sets of speakers whose native 
languages are Lı and Lo respectively. Let us assume that both sets of speakers 
read out a text in English. AEQUEVOX uses two ASR systems and obtains the 
transcript of this speech. AEQUEVOX then employs differential testing to find 
the word-level levenshtein distance [29] between these two sets of transcripts. 
Let us also assume that the average word-level levenshtein distance is two and 
four for Lı and Lz native speakers, respectively. 


AEQUEVOx then simulates a noisy environment by adding noise to the speech 
and obtains the transcript of this transformed speech. Let us assume now that 
the average levenshtein distance for this transformed speech is 4 and 25 for Lı 
and Lə native speakers, respectively. It is clear that the degradation for the 
speech of native Lə speakers is much more severe. In this case, the quality of 
service that Lə native speakers receive in noisy environments is worse than Lı 
native speakers. This is a violation of fairness which AEQUEVOX aims to detect. 


The working principle behind AEQUEVOX holds even if the spoken text is 
different. This is because AEQUEVOX just measures the relative degradation in 
ASR performance for a set of speakers. For large datasets, we are able to measure 
the average degradation in ASR performance with respect to different groups of 
speakers (e.g. male, female, native, non-native English speakers). 
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Fig. 2: Sound wave transformations 
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Fig. 3: AEQUEVOX System Overview 


Metamorphic Transformations of Sound: The ability to operate in a wide 
range of environments is crucial in ASR systems as they are deployed in safety- 
critical settings such as medical emergency services |18] and air traffic manag- 
ment [21], [27], which are known to have interference and noise. Metamorphic 
speech transformations serve to simulate such scenarios. The key insight for our 
metamorphic transformations comes from how waves are represented and what 
can happen to these waves when they’re transmitted in different mediums. We 
realise this insight in the fairness testing system for ASR systems. To the best 
of our knowledge AEQUEVOx is the first work that combines this insight from 
acoustics, software testing and software fairness to evaluate the fairness of ASR 
systems. AEQUEVOX uses the addition of noise (Figure 2 (b)), amplitude mod- 
ification (Figure 2 (c)), frequency modification (Figure 2 (d)), amplitude clip- 
ping (Figure 2 (e)), frame drops (Figure 2 (f)), low-pass filters (Figure 2 (g)), 
and high-pass filters (Figure 2 (h)) as metamorphic speech transformations. We 
choose these transformations because they are the most common distortions for 
sound in various environments [1]. 

System Overview: Algorithm 1 provides an outline of our overall test gener- 


ation process. We realise the notion of fairness described in Equation (2) using 
differential testing. The error rates (ASRp,,) for a particular speech clip are 
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found by finding the difference between the outputs of two ASR systems, ASR, 
and ASRg. It is important to note that we make a design choice to use differ- 
ential testing to find the error rate (ASRz,,). This helps us eliminate the need 
for ground truth transcription data which is both labor intensive and expensive 
to obtain. Furthermore, AEQUEVOxX realises the 6 seen in Equation (2) by using 
metamorphic transformations for speech (see Figure 3). These speech metamor- 
phic transformations represent the various simulated environments for which AE- 
QUEVOX wants to measure the quality of service for different groups. Addition- 
ally, the user can customise this 6 per their requirements. In our implementation 
we use eight distinct metamorphic transformations as 6 (see Figure 2). Specif- 
ically, we investigate how fairly do two ASR systems (ASR, and ASR2) treat 
groups (GRP | k € {1,2,---n}) with respect to a base group (GRg). AEQUEVOX 
achieves this by taking a dataset of speech which contains data from two or more 
different groups (e.g. male and female speakers, Native English and Non-native 
English speakers) and modifies these speech snippets through a set of trans- 
formations (MT). These are then divided into base group transformed speech 
(GR) and the transformed speech for other groups (GRP | k € {1,2,---n}). 
As seen in Algorithm 1, the average word-level levenshtein distance (word-level 
levenshtein distance divided by the number of words in the longer transcript) 
between the outputs of the two ASR systems is captured by dg and d% for 
the original and transformed speech respectively. Similarly, for the comparison 
groups GR} (k € {1,2,---n}) the word-level levenshtein distance is captured by 
d; and dj. The higher the levenshtein distance the larger the error in terms 
of differential testing. In other words, larger error in differential testing would 
mean that the ASR systems disagree on a higher number of words. 


To capture the degradation in the quality of service for the speech subjected 
to simulated environments (MT), we compute the difference between the word- 
level levenshtein distance for the original and transformed speech. Specifically, 
we compute Dg as di, —dp and Dx as d? — dp(k € {1,2,---n}) for the base and 
comparison groups, respectively. The higher this metric (Dg and Dx), the more 
severe the degradation in ASR quality of service because of the transformation T. 


We compare these metrics and if Dg exceeds D; by some threshold 7, we 
classify this as an error for the base group (GRg) and more specifically a violation 
of fairness (see Figure 3). In our experiments we set each of the groups in our 
dataset as the base group (GRg) and run the AEQUEVOx technique to find 
errors with respect to that base group. The lower the errors (as computed via 
the violation of the assertion Dg — Dk < T), the fairer the ASR systems are 
with respect to groups GRz. As an example, let us say Russian speakers are the 
base group (GRg), English speakers are the comparison group (GR,) and the 
value of 7 is 0.1. If Dg is strictly greater than Dz by 0.1, then fairness violation 
is counted for the Russian speakers. Otherwise, no fairness errors are recorded. 


Fault Localisation: AEQUEVOx introduces a word-level fault localisation tech- 
nique, which does not require any access to ground truth data. We first illustrate 
a use case of this fault localisation technique. 
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Algorithm 2 AEQUEVOX Fault Localizer 


1: procedure Fautr_ Loca.izer(WC, WC7?, w, param? ) 
2 Drop_ Count + Ø 

3 Non_Robust_ Words + Ø 

4 for word € WC.keys() do 

5 init_count + WC[|word] 

6 > Returns the minimum count of word across all the parameter 
7 > of transformation T 

8 min_count + get_min(WC7® [word], param?) 

9: count diff + max ((init_count — min_count), 0) 

10: if count_ diff > w then 

11: Non_Robust__ Words + Non_Robust_Words U {word} 
12 end if 

13 Drop_Count + Drop_Count U {count_diff} 

14 end for 

15: return Non_Robust_ Words, Drop __ Count 

16: end procedure 


A Drop F 
Word Level i : 
Fault Localizer 


ASR 


GRT WOT / Non Robust i 
i Words i 


Fig. 4: AEQUEVOX Fault Localization Overview 


Example: Let us consider a corpus of English sentences by a group of speakers 
(say GR) who speak language Lı natively. AEQUEVOX builds a dictionary for 
all the words in the transcript obtained from ASR,. An excerpt from such a 
dictionary appears as follows: {brother : 16, nice : 25, is : 33, --- }. This means 
the words brother, nice and is were seen 16, 25 and 33 times in the transcript 
respectively. Now, assume AEQUEVOX simulates a noisy environment by adding 
noise with various signal to noise (SNR) ratios as follows: {10, 8, 6, 4, 2}. This 
is the parameter for the transformation (param’). 

Once AEQUEVOxX obtains the transcript of these transformed inputs, it cre- 
ates dictionaries similar to the ones seen in the preceding paragraph. Let the 
relevant subset of the dictionary for SNR two (2) be {brother : 1, nice : 23, 
s : 32, +--+}. We use this to determine that the utterance of the word brother 
is not robust for noise addition for the group GR. This is because, the word 
brother appears significantly less in the transcript for the modified speech, as 
compared to the transcript for the original speech. 


AEQUEVOx fault localisation overview: Algorithm 2 provides an overview 
of the fault localization technique implemented in AEQUEVOx. The goal of the 
AEQUEVOx fault localisation is to find words for a group (GR) that are not ro- 
bust to the simulated environments. Specifically, AEQUEVOx finds words which 
are not recognised by the ASR when subjected to the appropriate speech trans- 
formations. 

The transformation is represented by Tọ. Here, T € MT is the transformation 
and 6 € param" is the parameter of the transformation, which controls the 
severity of the transformation. 

As seen in Algorithm 2, AEQUEVOX builds a word count dictionary for each 


word in WC and WC”? for the original speech and for each 0 € param? re- 
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spectively. For each word, AEQUEVOX finds the difference in the number of 
appearances for a word in WC and in WC”? for 6 € param’. To compute the 
difference, we locate the minimum number of appearances across all the trans- 
formation parameters 0 € param? (i.e. min_count in Algorithm 2). This is to 
locate the worst-case degradation across all transformation parameters. The dif- 
ference is then calculated between min_ count and the number of appearances of 
the word in the original speech (i.e. init_ count). If the difference exceeds some 
user-defined threshold w, then AEQUEVOx classifies the respective words as non 
robust w.r.t the group GR and transformation T. 

We envision that practitioners can then review the data generated by fault lo- 
calization (i.e. Algorithm 2) and target the non-robust words to further improve 
their ASR systems for speech from underrepresented groups [24] and accom- 
modate for speech variability [22]. In RQ3, we validate our fault localization 
method empirically and in RQ4, we show how the proposed fault localization 
method can be used to highlight fairness violations. 


4 Datasets and Experimental Setup 


ASR Systems under Test: We evaluate AEQUEVOX on three commercial ASR 
systems from Google Cloud Platform (GCP), IBM Cloud, and Microsoft Azure. 
We use the standard models for GCP and Azure, and the BroadbandModel for 
IBM. In all three cases, the audio samples were identically encoded as .wav files 
using Linear 16 encoding. 

In each of the following transformations, we vary a parameter, 0. We call this 
the transformation parameter. Some of the transformations have abbreviations 
within parentheses. Such abbreviations are used in later sections to refer to the 
respective transformations. 


Amplitude Scaling (Amp): For amplitude scaling, we scale the audio sequence 
by a constant by multiplying each individual audio sample by 6. 


Clipping: The audio samples are scaled such that their amplitude values are 
bound by [—1, 1]. AEQUEVoxX then clips these samples such that the amplitude 
range is [—0, 0]. These clipped samples are then rescaled and encoded. 

Drop/Frame: For Drop, AEQUEVOx divides the audio into 20ms chunks. 6% 
of these chunks are then randomly discarded (amplitude set to zero) from the 
audio. For Frame, AEQUEVOX divides the audio into ms chunks and 10% of 
these chunks are then randomly discarded. No two adjacent chunks are discarded. 
High Pass (HP)/ Low Pass (LP) Filter: Here we apply a butterworth [7] 
filter of order two to the entire audio file with 0 determining the cut-off frequency. 
Noise Addition (Noise): 0 represents signal to noise (SNR) ratio [25] of the 
transformed audio signal. A lower 6 means higher noise in the transformed audio. 
Frequency Scaling (Scale): In this case, 0 is the sampling frequency. The 


lower the value of 0, the slower the audio. In this transformation, the audio is 
slowed down @ times. 
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Table 2: Transformations Used 


Transformation Type 0 Used 


Least Destructive —>+ Most Destructive 


Amplitude 0.5 0.4 0.3 0.2 0.1 
Clipping 0.05 0.04 0.03 0.02 0.01 
Drop 5 10 15 20 25 
Frame 10 20 30 40 50 
HP 500 600 700 800 900 
LP 900 800 700 600 500 
Noise 10 8 6 4 2 
Scale 0.9 0.8 0.7 0.6 0.5 
Table 3: Datasets Used 
Dataset Duration(s) #Clips a eae 
Accents 25-35 28 28 
RAVDESS 3 32 8 
Midlands 3-5 4 4 
Nigerian English 4-6 4 4 


Table 2 lists all the different values used for 0. An additional parameter 
(0 = 2.0) is used for Amp. 
Datasets: We use the Speech Accent Archive (Accents) [54], the Ryerson Audio- 
Visual Database of Emotional Speech and Song (RAVDESS) [30], Multi speaker 
Corpora of the English Accents in the British Isles (Midlands) [11], and a Nige- 
rian English speech dataset |2] to evaluate AEQUEVOX taking care to ensure 
male and female speakers are equally represented. Table 3 provides additional 
details about the setup. 


5 Results 


In this section, we discuss our evaluation of AEQUEVOX in detail. In particular, 
we structure our evaluation in the form of four research questions (RQ1 to 
RQ4). The analysis of these research questions appears in the following sections. 


RQ1: What is AEQUEVOX’s efficacy? 

We structure the analysis of this research question into three sections, each 
corresponding to a dataset we have used in our analysis. All of the relevant data 
is presented in Table 4 with the lowest errors for each dataset bolded. We first 
analyse the number of errors (used interchangeably with fairness violations) for 
each case. Subsequently, we analyse the sensitivity of the errors with respect 
to the values of 7 (7 € {0.01, 0.05, 0.1, 0.15}). Detecting violations of fairness 
is regulated by parameter 7. Lower values of 7 imply that the degradation of 
word error rates between two groups should be similar, and conversely higher 
values of 7 allow for the difference in degradation of word error rates to be 
more severe between two groups. Next, we analyse the sensitivity of the pairs 
of the ASR systems under test. Concretely, we analyse the errors found in the 
Microsoft Azure and IBM Watson (MS_ IBM), Google Cloud and IBM Watson 
(IBM_ GCP), and Microsoft Azure and Google Cloud (MS_ GCP) pairs. Finally, 
we analyse the sensitivity of the AEQUEVOX test generation with respect to the 
eight different types of transformations implemented (see Figure 2). 
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Table 4: Errors Discovered by AEQUEVOX 


Nigerian/ Midlands 
English 


English Ganda French Gujarati Indonesian Korean Russian Male Female Midlands Nigerian 


Accents RAVDESS 


Tt Sensitivity 


0.01 168 381 267 232 178 499 354 12 2 36 75 
0.05 75 245 99 101 85 340 227 8 53 26 65 
0.10 43 145 39 49 34 172 161 5 17 55 
0.15 26 73 8 24 14 75 111 3 10 14 44 
ASR Sensitivity 

MS IBM 36 369 128 126 64 388 303 10 57 30 86 
GCP IBM 131 325 123 147 98 342 361 9 64 31 96 
MS GCP 145 150 162 133 149 356 189 9 55 32 57 

Transition Sensitivty 
Clipping 4 81 38 159 72 182 237 0 24 50 3 
Drop 8 113 33 29 40 184 45 0 2 4 33 
Frame 14 106 61 25 36 170 26 1 13 13 19 
Noise 5 128 54 86 22 217 213 0 24 5 43 
LP 39 158 108 57 14 110 208 0 45 4 34 
Amplitude 81 19 44 33 14 40 26 0 27 8 40 
HP 114 168 29 9 61 87 57 9 20 1 51 
Scale 47 71 46 8 52 96 41 18 2 8 16 
Total Errors 312 844 413 406 311 1086 853 28 176 93 239 


It is important to note that we excluded the two most destructive Scale trans- 
formations. This is because the word error rate for these transformations is 0.89 
on average out of 1. This degradation may be attributed to the transformation 
itself rather than the ASR. To avoid such cases, we exclude these transformations 
from this research question. 


Accents Dataset: Native English speakers and Indonesian speakers have the 
lowest number of errors. On average, speech from non-native English speakers 
generates 109% more errors in comparison to speech from native English speak- 
ers. For the two smallest values of 7, speech from the native English speakers 
shows the least number of fairness violations. Speech from native English speak- 
ers has the lowest, second lowest and third lowest errors for the pairs of ASRs, 
(MS_ IBM), (MS_ GCP) and (IBM_ GCP) respectively. Speech from native En- 
glish speakers has the lowest errors for the clipping, two types of frame drops 
and noise transformations and the second lowest errors for the low-pass filter 
transformation. The high-pass filter and scaling induce a comparable number 
of errors from native and a majority of the non-native English speakers. How- 
ever, speech from native English speakers has the highest number of errors when 
subject to the amplitude transformation. 


Speech from non-native English speakers generally exhibits more fairness 
violations in comparison to speech from native English speakers. 


RAVDESS Dataset: Speech from male speakers has significantly lower errors 
than speech from female speakers. On average, speech from female speakers 
generates 528.57% more errors in comparison to speech from male speakers. 
Speech from male speakers shows significantly fewer fairness violations for all 
values of 7, and for all ASR pairs tested. Clipping, both types of frame drops, 
noise, low-pass, amplitude and high-pass transformations induce significantly 
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fewer errors on speech from male speakers. However, speech from male speakers 
has more errors when subject to scale transformations. 


Speech from female speakers has significantly more fairness violations in 
comparison to speech from male speakers. 


Midlands/Nigeria Dataset: Speech from UK Midlands English (ME) speak- 
ers has significantly fewer errors than speech from Nigerian English (NE) speak- 
ers. On average, speech from NE speakers generates 156.9% more errors in com- 
parison to speech from ME speakers. Speech from ME speakers has significantly 
fewer fairness errors for all values of 7, and for all ASR pairs tested. For the 
transformations scale, drop, noise, amplitude, low pass and high pass filters, the 
speech from ME speakers has significantly fewer errors than speech from NE 
speakers. Clipping induces more errors in speech from ME speakers, while the 
frame transformation induces comparable number of errors in speech from both 
groups. 


Speech from Nigerian English speakers has significantly more fairness errors 
in comparison to speech from UK Midlands speakers. 


RQ2: What are the effects of transformations on comprehensibility? 

To better understand the effects of the transformations (see Figure 2) on 
the comprehensibility of the speech we conducted a user study. Speech of one 
randomly chosen female native English speaker from the Accents [54] dataset 
was used since the audio contains nearly all the sounds present in the English 
language [54]. Survey participants were presented with the original audio file 
along with a set of transformed speech files in order of increasing intensity. All 
the transformations (see Figure 2) and transformation parameters (see Table 2) 
were used. We asked 200 survey participants (sourced through Amazon mTurk) 
the following question: 


How comprehensible is (transformed) Speech with respect 
to the Original speech? 


The rating of one (1) is Not Compre- 


18 ae One hensible at all and the rating of ten (10) is 
e + Dr : Shook 
ay Bie’ Just as Comprehensible as the Original. 
£ eP 
Fra ca Meee Unsurprisingly, as seen in Figure 5, in- 
Bes Serre creasing the intensities of the transforma- 
E 
Š 


tion had a generally detrimental effect on 
the comprehensibility of the speech. But 
none of the transformations majorly affect 
the comprehensibility of the speech. All of 
Fig.5: Average Transformation the transformations had an average com- 
Comprehensibility Ratings prehensibility rating above 6.75 and 82.9% 
of the transformations had a comprehensi- 
bility rating above 7. 


e 
o 


6.8 


Least Destructive Most Destructive 
Transformation Transformation 
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Table 5: Fairness errors where the transformations have a comprehensibility rat- 
ing of at least 7.2 


Accents RAVDESS Nigerian/Midlands English 
English Ganda French Gujarati Indonesian Korean Russian Male Female Midlands Nigerian 


Total Errors 246 509 240 166 225 687 329 28 88 55 161 


Table 6: Grammar-generated sentence examples 


ASR Microsoft Google Cloud IBM Watson 
Robust Ashley likes fresh smoothies Karen loves plastic straws William detests plastic cups 
Paul adores spoons of cinnamon Donald hates big decisions Steven detests big flags 


Non-robust Ashley detests thick smoothies John loves spoons of cinnamon Betty likes scoops of ice cream 
Ryan likes slabs of cake Robert loves bags of concrete Amanda is fond of things like 
groceries 


The average degradation in comprehensibility for the least destructive pa- 
rameter across all transformations was 24.36%. Noise was the most destructive 
at 27.75% and drop was the least destructive (20.96%). 

The average degradation in comprehensibility for the most destructive pa- 
rameter across all transformations was 29.18%. In this case, scaling was the most 
destructive at 32.23% whereas drop was the least destructive with 25.88%. 

Additionally, for each transformation, we analyse the percentage drop of com- 
prehensibility between the least and the most destructive transformation param- 
eters. The average drop is 4.82% across all transformations. The scaling and drop 
transformations show high relative percentage drops of 10.05% and 8.32% respec- 
tively. Amplitude, clipping, noise, high-pass and low-pass filters show closer to 
average drops between 3.1% and 4.5%. Frame, on the other hand, shows very 
low relative drops at 0.76%. 


All the transformations, though destructive, are comprehensible by humans. 


For safety critical applications, we recommend that future work test the 
whole gamut of transformations. For other use cases, practitioners may choose 
the transformations that satisfy their needs. To aid this, AEQUEVOX allows the 
users to choose the comprehensibility threshold of the transformations. As seen 
in Table 5, our conclusion holds even if we choose the transformations with 
higher comprehensibility threshold (7.2). We highlight the group with the least 
errors in each dataset to aid in readability. In particular, we observe that speech 
from male and UK Midlands speakers generally exhibit fewer errors. Setting 
aside speech from native Gujarati speakers, speech from native English speakers 
exhibits comparable or better performance than speech from other groups. 


RQ3: Are the outputs produced by AEQUEVOX fault localiser valid? 

To study the validity of the outputs of the fault localiser, we study the 
number of errors for the predicted robust and non-robust words. We do this 
by generating speech containing the predicted robust and non-robust words for 
each ASR tested. We choose an w of three, three and two for GCP, MS Azure 
and IBM respectively to choose the non-robust words (see Algorithm 2). We 
choose the robust words from the set of words that do not show any errors 
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in the presence of noise (count_ diff = 0 in Algorithm 2) for these specific ASR 
systems. Specifically, we test whether the robust and non-robust words identified 
by the fault localiser in the Accents dataset are robust in the presence of noise. 
Our goal is to show that if noise is added to speech containing these non-robust 
words, the ASR will be less likely to recognise them. Vice-versa, if noise is added 
to the predicted robust-words they are less likely to be affected. 

To generate the speech from the output we generate sentences containing the 
robust and non-robust words predicted by the fault localiser for each ASR using 
a grammar and then use a text-to-speech (TTS) service to generate speech. 
The actual randomly selected robust and non-robust words (in bold) and the 
examples of the sentences generated by the grammar can be seen in Table 6. 
We use the Google TTS for MS Azure and we use the Microsoft Azure TTS for 
GCP and IBM to generate the speech. 

To evaluate the generality of outputs of the fault localisation technique, we 
use the speech produced by the TTS and then add noise to that speech. This 
speech is used to generate a transcript from the ASR and the transcript is used to 
evaluate how many of the predicted robust and non-robust words are incorrect 
in the transcript. We add the most noise possible to the TTS speech in our 
AEQUEVOxX framework. Specifically, the signal to noise (SNR) ratio is 2. We use 
the TTS generated speech for 50 sentences for each of the robust and non-robust 
cases. Each sentence has either a robust or a non-robust word. 

The results of the experiments are seen in Table 7. In the transcript of the 
speech with noise added at SNR 2, robust words show zero errors for the pre- 
dicted robust words for Microsoft and Google Cloud and 21 errors for IBM. The 
non-robust words, on the other hand, had 23, 15 and 30 errors. 


The predicted non-robust words have a higher propensity for errors than the 
robust words. 


Table 7: Transcript Errors Table 8: Grammarly Scores 
ASR MOOL ares ASR Overall Correctness Clarity 
Sentences Score 
Microsoft (MS) Microsoft (MS) 
Robust 0 Robust 99 
Non-Robust 23 Non-Robust 99 
Google Cloud (GCP) Google Cloud (GCP) Looking Very 
Robust 0 Robust 100 Good Clear 
Non-Robust 15 Non-Robust 99 
IBM Watson (IBM) IBM Watson (IBM) 
Robust 21 Robust 100 
Non-Robust 30 Non-Robust 96 


Note on grammar validity: Since the grammars used by us to validate the 
explanations of AEQUEVOX are handcrafted, they may be prone to errors. To 
verify these hand crafted grammars, we use 100 sentences produced by each 
grammar and use the online tool Grammarly [3] to investigate the semantic and 
syntactic correctness of the sentences and the clarity. The sentences generated 
by the grammars have a high overall average score of 98.33 out of 100, with the 
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Table 9: Average word mispredictions in the Accents dataset using the AE- 
QUEVOx localisation techniques 


Accents 


English Ganda French Gujarati Indonesian Korean Russian 


ASR Sensitivity 


GCP 1.21 1.51 1.21 1.17 1.07 1.55 1.64 

IBM 1.03 1.94 1.38 1.35 1.48 1.92 1.70 

MS Azure 0.47 0.66 0.40 0.48 0.36 0.87 0.63 
Transition Sensitivity 

Clipping 2.00 2.53 2.12 2.60 2.29 2.81 3.13 

Drop 0.30 1.02 0.52 0.54 0.57 1.15 0.74 

Frame 0.38 0.89 0.68 0.56 0.51 1.19 0.65 

Noise 0.57 1.60 0.85 1.27 0.71 1.74 1.54 

LP 1.72 2.22 1.90 1.79 1.58 1.98 2.13 

Amplitude 0.17 0.15 0.11 0.12 0.06 0.20 0.16 

HP 0.74 0.75 0.38 0.22 0.49 0.64 0.76 

Scale 1.38 1.79 1.42 0.90 1.54 1.89 1.45 


lowest being 96 (see Table 8). On the correctness and clarity measure, all the 
sentences generated by the grammars score Looking Good and Very Clear. 


RQA4: Can the fault localiser be used to highlight unfairness? 


The goal of this RQ is to investigate if the output of Algorithm 2 can call at- 
tention to bias between different groups. Specifically, we evaluate if some groups 
show fewer faults, on average than others. To this end, we use the fault local- 
isation algorithm (Algorithm 2) on the accents dataset and record the number 
of words incorrect in the transcript, on average for each group of the accents 
dataset. This is done for each ASR under test. It is also important to note that 
this technique uses no ground truth data and requires no manual input. This 
technique is designed to work with just the speech data and metadata (groups). 


Table 9 shows the average word drops across all transformations for the ac- 
cents dataset for each ASR under test. We highlight the best performing groups 
by bolding the values. Speech from native-English speakers shows the lowest 
average word drops for the IBM Watson ASR and the third lowest for GCP 
and MS Azure ASRs. We also investigate the average word drops for each trans- 
formation in AEQUEVOX averaged across all ASRs. Speech from native English 
speakers has the lowest average word drops for the Clipping, two types of frame 
drops and noise transformations and the second lowest errors for the low-pass 
filter transformation. (see Table 9). For the rest of the transformations, namely 
amplitude, high-pass filter and scaling, we find that both speech from non-native 
English speakers and speech from native English speakers have comparable av- 
erage word drops in the majority of cases (see Table 9). This result is consistent 
with results seen in RQI. 


The technique seen in Algorithm 2 can be used to highlight bias in speech and 
the results are consistent with RQ1. 
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6 Threats to Validity 


User Study: In conducting the study, two assumptions were made. Firstly, 
we assume that the degree to which comprehensibility changes when subject 
to transformations is independent of the characteristics of the speaker’s voice. 
Secondly, we assume that the speech is reflective of the broader English language. 
In future work, a larger scale user study could be performed to verify the results. 


ASR Baseline Accuracy: AEQUEVOX measures the degradation of the speech 
to characterise the unfairness amongst groups and ASR systems. If the baseline 
error rate is very high, then the room for further degradation is very low. As a 
result, AEQUEVOX expects ASR services to have a high baseline accuracy. To 
mitigate this threat, we use state-of-the-art commercial ASR systems which have 
high baseline accuracies. 


Completeness and Speech Data: AEQUEVOx is incomplete, by design, in 
the discovery of fairness violations. AEQUEVOx is limited by the speech data 
and the groups of this speech data used to test these ASR systems. With new 
data and new groups, it is possible to discover more fairness violations. The 
practitioners need to provide data to discover these. In our view, this is a valid 
assumption because the developers of these systems have a large (and growing) 
corpus of such speech data. It is also important to note that AEQUEVOX does 
not need the ground truth transcripts for this speech data and such speech data 
is easier to obtain. 


Fault Localisation: To test AEQUEVOXx’s fault localisation, we identify the 
robust and non-robust words in the speech and subsequently construct sentences 
(with the aid of a grammar). These sentences are then converted to speech using a 
text-to-speech (TTS) software and the performance of the robust and non-robust 
words is measured. In the future, we would like to repeat the same experiment 
with a fixed set of speakers, which allows us capture the peculiarities of speech 
in contrast to the usage of TTS software. 


7 Related Work 


In the past few years, there has been significant attention in testing ML systems 
[35,48,32,47,55,34,50,40,56,16,52,8,41,19]. Some of these works target coverage- 
based testing [48,55,34,32] or leverage property driven testing [41], while others 
focus on effective testing in targeted domains e.g. text [50,40]. None of these 
works, however, are directly applicable for testing ASR systems. In contrast, the 
goal of AEQUEVOX is to automatically discover violations of fairness in ASR 
systems without access to ground truth data. 

DeepCruiser [13] uses metamorphic transformations and performs coverage- 
guided fuzzing to discover transcription errors in ASR systems. Concurrently, 
CrossASR, [5] uses text to generate speech from a TTS engine and subsequently 
employs differential testing to find bugs in the ASR system. In contrast to these 
systems, the goal of AEQUEVOxX is to automatically find violations of fairness 
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by measuring the degradation of transcription quality from the ASR when the 
speech is transformed. AEQUEVOX compares this degradation across various 
groups of speakers and if the difference is substantial, AEQUEVOX characterises 
this as a fairness violation. Moreover, AEQUEVOX neither requires access to 
manually labelled speech data nor does it require any white/grey box access 
to the ASR model. Works on audio adversarial testing [23], [10], [9], [37], [28] 
aims to find an imperceptible perturbation that are specially crafted for an 
audio file. In contrast, AEQUEVOx aims to find fairness violations. Additionally, 
AEQUEVOx also proposes automatic fault localisation for ASR systems without 
using a ground truth transcript. 

Unlike AEQUEVOXx, recent works on fairness testing have focused on credit 
rating [17,49,4,57,42,44,43,41], computer vision [12,6] or NLP systems [33,45]. In 
the systems that deal with such data, it is possible to isolate certain sensitive at- 
tributes (gender, age, nationality) and test for fairness based on these attributes. 
It is challenging to isolate such sensitive attributes in speech data, necessitating 
the need for a separate fairness testing framework specifically for speech data. 

Frameworks such as LIME [38], SHAP [31], Anchor [39] and DeepCover [46] 
attempt to reason why a model generates a specific output for a specific input. In 
contrast to this, AEQUEVOx’s fault localisation algorithm identifies utterances 
spoken by a group which are likely to be not recognised by ASR systems in the 
presence of a destructive interference (such as noise). Recent fault localization 
approaches either aim to highlight the neurons [15] or training code [53] that 
are responsible for a fault during inference. In contrast, AEQUEVOXx highlights 
words that are likely to be transcribed wrongly without having any access to the 
ground truth transcription and with only blackbox access to the ASR system. 


8 Conclusion 


In this work, we introduce AEQUEVOX, an automated fairness testing technique 
for ASR systems. To the best of our knowledge, we are the first work that 
explores considerations beyond error rates for discovering fairness violations. 
We also show that the speech transformations used by AEQUEVOx are largely 
comprehensible through a user study. Additionally, AEQUEVOx highlights words 
where a given ASR system exhibits faults, and we show the validity of these 
explanations. These faults can also be used to identify unfairness in ASR systems. 

AEQUEVOxX is evaluated on three ASR systems and we use four distinct 
datasets. Our experiments reveal that speech from non-native English, female 
and Nigerian English speakers exhibit more errors, on average than speech from 
native English, male and UK Midlands speakers, respectively. We also validate 
the fault localization embodied in AEQUEVOX by showing that the predicted 
non-robust words exhibit 223.8% more errors than the predicted robust words 
across all ASRs. 

We hope that AEQUEVOx drives further work on systematic fairness testing 
of ASR systems. To aid future work, we make all our code and data publicly avail- 
able at github.com/sparkssss /AequeVox and 10.5281 /zenodo.5897347 
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Abstract. Large distributed systems with an emphasis on adaptability 
are now considered a necessity in many domains, yet reconfiguration of 
these systems is still largely carried out in an ad hoc fashion, a process 
that is both inefficient and error-prone. In this paper, we tackle the 
planification problem for the reconfiguration of distributed systems in 
the component-based reconfiguration model Concerto. Specifically, given 
some tasks to execute and a desired final state of the system, we show how 
to compute a reconfiguration plan that guarantees satisfaction of inter- 
component dependencies and is also optimized for parallel execution. Our 
technique relies on an SMT solver to compute the required dependencies 
between components and ultimately schedule the reconfiguration. We 
illustrate the use of this technique on a variety of synthetic examples as 
well as a real use case in the context of an OpenStack system. 


Keywords: reconfiguration, planning, synthesis, component models, dis- 
tributed systems 


1 Introduction 


Large distributed software systems are now ubiquitous, with component-based 
systems (e.g., service-oriented architectures or microservices) offering a conve- 
nient way to structure large applications. Indeed, isolating functionalities in com- 
ponents and building systems through composition greatly enhances adaptabil- 
ity and scalability of applications, two important requirements for many orga- 
nizations. This approach is also promoted by the massive adoption of highly- 
distributed computing infrastructures such as cloud and edge computing. 
However, the advantages of distributed architectures come at the price of 
increased complexity and technical challenges related to observability, coordina- 
tion, maintenance, etc. Notably, the system reconfigurations that are required 
to achieve adaptability commonly lead to faults. For example, a study of 597 
unplanned outages that affected popular cloud services between 2009 and 2015 
found that 16% of them were caused by a software or hardware upgrade [16]. The 
study concludes that “the complexity of cloud hardware and software ecosystem 
has outpaced existing testing, debugging, and verification tools”. Indeed, testing 
and debugging methods are largely inadequate in the context of distributed sys- 
tems, while the adoption of more suitable formal methods remains marginal in 
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industry. The latter can be attributed to the difficulty of using formal methods 
and tools. Yet formal methods can lighten the burden of program developers and 
system administrators instead of adding to it, with synthesis techniques used to 
generate correct-by-construction programs. In that spirit, we propose to em- 
ploy a Satisfiability Modulo Theories (SMT) solver to automate the planning of 
reconfigurations (deployment, migrations, software updates, etc.) of component- 
based systems, i.e., to generate programs that coordinate the non-functional 
operations required to perform such reconfigurations. There have been some 
attempts to synthesize reconfiguration programs for component-based systems 
(some of them relying on an SMT solver), but they either target ad hoc, non- 
executable models [20], or are limited to specific cases such as deployment [22], 
where the problem of executing parallel tasks is reduced to finding a precedence 
order. In contrast, our work targets the full scope of the component-based re- 
configuration model Concerto [9], which provides a formally-defined execution 
model with expressive constraints on parallelism, as well as a concrete execution 
engine, making it suitable for formal analysis and experimental work. 


In Concerto, reconfigurations are driven by asynchronous behavior requests 
to components. The execution of a behavior may depend on the state of other 
components: such dependencies are denoted by ports that form the interface of 
components, indicating their provisions and requirements towards each other. 
Section 2 gives an overview of Concerto, for a more complete presentation, the 
reader can refer to [9]. Our goal with this work is to automatically generate 
reconfiguration scripts for systems of Concerto components, i.e., determine re- 
quired behaviors and coordinate their execution. We take as starting point a 
reconfiguration goal composed of behaviors to execute over some components 
and a specification of the final state of the system, particularly the statuses of 
ports. That goal may be provided by a system administrator, or could have been 
generated in the context of an autonomic control loop [19]. Importantly, it is a 
partial specification that typically only mentions parts of the system. For exam- 
ple, an administrator may specify only that a certain utility component should 
execute a behavior to update its software, whereas the completion of this task 
actually requires other components to suspend and later resume their activity. 


Since a reconfiguration goal can require changes in any component of a sys- 
tem, the search space for reconfiguration scripts grows rapidly with the number 
of components. To synthesize reconfigurations for large systems, we propose a 
novel technique that takes advantage of the nature of component-based mod- 
els. It first solves the problem for each component individually, by considering 
the internals of the component to find relevant behaviors, under the simplifying 
assumption that external requirements are all satisfied. Later the method coor- 
dinates behaviors over the whole system, relying on a first-order encoding of the 
scheduling problem and making use of the model-finding capabilities of an SMT 
solver. If this step fails due to unsatisfied dependencies, individual component 
reconfiguration goals are refined and the process iterated. Section 3 describes this 
method, and Section 4 measures its performance and scalability on a variety of 
synthetic examples, and illustrates its applicability on a real use case. 
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2 Reconfiguration With Concerto 


Components and Assemblies. A distributed system in Concerto is represented as 
an assembly, i.e., a collection of components that correspond to control entities 
for the elements of the system. Components are not intended to represent the 
functional aspects of those elements, but instead to pilot the actions (installation, 
maintenance, suspension of service, etc.) required to operate them during their 
lifespan. In other words, a Concerto component is a wrapper around a new or 
legacy piece of software (e.g., service, module), typically written by its developer, 
that acts as replacement for scripts to install and maintain it. 

The structural interface of a component is provided by its provide ports and 
use ports. Provide ports denote services or data provided by that component 
when those ports are active, while use ports denote requirements that the com- 
ponent has when those ports are active. Ports can be connected in an assembly 
to allow the satisfaction of component requirements. Connected ports impose 
synchronization rules between their components: a use port cannot be activated 
unless connected to an active provide port (the user component may have to 
wait for that requirement to be fulfilled in order to continue its internal activity) 
and a provide port cannot be deactivated while connected to an active use port. 

Internally, components are characterized by places representing milestones in 
the life cycle, and transitions between places, mapped to concrete reconfiguration 
actions (e.g., starting a virtual machine, downloading an image, etc.). The inter- 
nal state of a component is given by its places: at any point during execution, one 
or more places are active. While a place 7 is active, transitions originating from 
it can be (simultaneously) fired, after which 7 ceases to be active. Conversely, a 
place 7’ becomes active after the completion of all the transitions that reach it. 
The completion of a transition takes a non-deterministic duration after firing, 
modeling the execution of the associated action. Active places also determine the 
statuses of ports: each port is bound to a set of places, and is active whenever 
one of them is active. Thus the status of ports changes according to the life cycle 
of the component. In graphic representations, ports are linked to the place (or 
set of places, denoted by rounded boxes) to which they are bound. 

The last characteristic attribute of a component is its set of behaviors. A 
behavior is a subset of the transitions in a component, such that the associated 
subgraph is acyclic. At any point in an execution, a component may execute one 
behavior. Only then can the transitions in that behavior be fired. The behaviors 
of a component serve as its operational interface: a component may have one 
behavior including the actions to start it, another including the actions to update 
it, etc. A component can be requested to execute a behavior, which will determine 
its evolution and the actions that it performs. Graphically, different behaviors 
are represented by depicting transitions in different colors. 

Figure 1 gives a graphic representation of an assembly. Component dept 
includes three places (uninstalled, installed and running) and three transi- 
tions (arrows between places) that belong to three behaviors (deploy, update, 
and uninstall). Place running is active (denoted by a token) and bound to pro- 
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vide port service, whereas places installed and running are bound to provide 
port config. Both ports are connected to use ports belonging to server. 


service 


uninstall 


service 


deploy deploy 
update suspend 
uninstall uninstall 


depl server 


Fig. 1: A Concerto assembly with three components. For readability, the bindings 
of ports configt and config2 are only partially depicted: they also contain 
places configured, running, s1 and s2. 


Reconfiguration Scripts. Concerto is equipped with a simple language to exe- 
cute reconfigurations. Whereas a Concerto component is written by a developer, 
the reconfiguration language is intended to be used by system administrators 
or DevOps engineers. Components are piloted through asynchronous requests 
via the command pushB(id, b) that asks the component identified by id to 
execute behavior b. The command takes its name from the fact that requests 
received by a component are queued and asynchronously executed by that com- 
ponent in the order in which they were received. While a component executes 
a behavior request, transitions in that behavior are fired until the component 
reaches a state where none of them can be fired. The behavior request is then 
considered complete, and the component executes the next one, until no more re- 
quests remain. The Concerto language also provides synchronization commands: 
wait(id) blocks the execution of the reconfiguration program until the compo- 
nent identified by id has executed all behaviors requests submitted to it, and 
waitAl1() blocks the execution until all components have executed all pending 
behavior requests. These three commands allow parallel asynchronous execution 
in Concerto, leading to more efficient reconfigurations. Based on the description 
of the components provided by their developers, Concerto can execute reconfig- 
uration scripts , allowing for empirical performance comparisons [10]. 
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The goal of this work is to generate a reconfiguration script using the three 
aforementioned commands to execute behaviors over components and bring them 
to a desired state. In addition to those three commands, the Concerto language 
also provides four usual commands to modify the topology of an assembly: create 
and delete components, connect and disconnect them. These operations are out 
of the scope of reconfiguration planning as we define it. Indeed, the decision to 
modify the topology of the assembly is usually taken by the same entity that 
determines reconfiguration goals (system administrator or autonomic analysis 
tool) [15,17] rather than left to the planning phase. Furthermore, if topological 
changes in the assembly are deemed necessary, they can almost always be imple- 
mented through a reconfiguration script with the following steps: (i) creations 
of components, (ii) creations of connections, (iii) changes in component states, 
(iv) deletions of connections and (v) deletions of components [5,7]. The main 
difficulty is to determine the operations of the third step that take the compo- 
nents to a safe state, in particular ensuring that none of the connections that 
will be deleted include an active use port. Computing a reconfiguration program 
to lead components to a desired state (or to have them perform some required 
operations) is the focus of this paper. 

As an example, consider the assembly in Figure 1, where all the components 
are running. We wish to run software updates on dep1 and dep2, but this will 
deactivate their provide port service. To carry out the updates, component 
server must first deactivate its corresponding use ports, which is accomplished 
by executing its behavior suspend. Figure 2a depicts a reconfiguration script 
that performs this, then returns the components to a running state. No explicit 
synchronization is needed between the suspension of server and the updates: the 
execution model of Concerto ensures that the updates cannot be executed as long 
as the provide ports are in use. An explicit synchronization is however needed 
before re-deploying the server, to prevent it from reactivating its use ports before 
the updates start. As a side note, the ports config (that represent configuration 
information that is not affected by the update, such as connection information) 
remain active throughout the reconfiguration: the fine-grained management of 
dependencies in Concerto avoids a full restart of the system. This assembly also 
illustrates the capacity of Concerto components to execute actions in parallel: 
for example, after server has reached place allocated, it can fire multiple 
transitions, corresponding to independent reconfiguration actions. 

Concerto provides structured semantic tools to design efficient reconfigura- 
tion plans with highly parallel, asynchronous execution. However, taking full 
advantage of these features adds complexity to the internal structure of compo- 
nents and to associated reconfiguration scripts. Automated synthesis of recon- 
figuration scripts is therefore particularly useful in this context. 


3 Reconfiguration Script Synthesis 


This section describes the synthesis process used to generate reconfiguration 
scripts. This process takes as input a description of the current state of the 
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system, namely the topology of the assembly (components and their connec- 
tions) and the active places. We assume that the system is in a state where no 
component has pending behavior requests or ongoing transitions. Besides that 
information, the synthesis process also depends on a reconfiguration goal that 
is composed of (i) constraints Tports on the final state of ports and (ii) a set of 
behaviors Iba to execute on designated components. 

The constraints [ports are given by a partial function that maps specific in- 
stances of component ports to a boolean indicating whether that port is required 
to be active or inactive. A reconfiguration satisfies that goal if it ends in a state 
such that for any component c and port p, if Pports(c,p) is defined, the port p of 
component c is active if and only if F ports(c, p) = T. Where the value of TP ports 
is undefined, any status of the port satisfies the constraint. This means that a 
reconfiguration goal does not have to specify a unique final state for components, 
but instead allows for multiple target states. It may appear tedious to specify 
constraints for all components of an assembly when a reconfiguration is specif- 
ically aimed at a subset of it, but in practice the current state of the assembly 
can be used to guide the choice of [ports for those other components. A reason- 
able strategy might specify that provide ports active before the reconfiguration 
should remain active, and leave other ports unspecified. 

The other element of the reconfiguration goal is the set Tans, where each 
element is a pair composed of a component and a behavior. The reconfiguration 
satisfies it if it executes at least all these behaviors on the corresponding compo- 
nents. The set lsn alone may not correspond to a feasible reconfiguration. For 
example, a system administrator wishing to update the components of Figure 1 
might give a behavior goal leny = {(dep1, update), (dep2, update)} and a port 
goal Torts that maps every port instance to T. The behaviors listed in that 
reconfiguration goal are not enough to carry it out, as it lacks a behavior to 
deactivate the use ports of the server prior to the update, and behaviors to re- 
activate all ports after the update. The synthesis process must therefore deduce 
necessary behaviors to carry out the reconfiguration goal, then schedule their 
executions in a suitable order. It proceeds as follows: 


1. for each component independently, we find a sequence of behaviors that satis- 
fies the goal, assuming that ports requirements are fulfilled (Subsection 3.1); 

2. we find a global schedule for these sequences of behaviors (Subsection 3.2); 

3. if the scheduling problem is found unsatisfiable, we analyze the incomplete 
schedule to deduce unsatisfied port requirements, compute additional recon- 
figuration sub-goals and iterate the process (Subsection 3.3); 

4. once a feasible solution has been found, we attempt to optimize it by relaxing 
synchronization conditions (Subsection 3.4). 


3.1 Determining Sequences of Component Behaviors 


A procedure localSeq(c, act°, IT bhu, l ports) finds a sequence of behaviors that sat- 
isfies a reconfiguration goal I’ for a single component c starting in a state with 
active places act®. This is achieved by enumerating all sequences of behaviors 
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with at most one occurrence of any behavior, and selecting one that satisfies 
the goal constraints. In practice, this enumeration is short because the num- 
ber of behaviors of a component is usually small. More importantly, for a given 
component state (denoted by its active places), many behaviors do not have tran- 
sitions originating from the active places. Since executing these behaviors would 
not have any effect, they can be ignored during the enumeration. Consequently, 
the number of useful sequences of behaviors to analyze is often much lower than 
the number of permutations. If no satisfying sequence is found by localSeq, then 
the problem has no solution, and the whole synthesis process fails. However, if 
multiple solutions are returned, the best possible sequence is picked, according 
to some (possibly user-defined) selection criterion. Some interesting optimiza- 
tion criteria are: the length of a sequence, its execution time (if time estimations 
are available for individual transitions, this may be computed with great accu- 
racy [10]), the number of transitions it executes sequentially, or the number of 
ports it (de)activates. In our experiments, we used this last criterion, as it picks 
the component reconfiguration that is least likely to induce changes in other 
components, leading to simpler and potentially faster reconfiguration plans. 

In order to coordinate sequences or behaviors across the assembly, we keep 
track of ports requirements and activity during each behavior of a sequence. In 
particular, for each behavior in a sequence, we record use ports of the component 
that are activated at least once by the behavior (they must be connected to an 
active provide port during the execution of the behavior), and provide ports that 
are deactivated at least once (they must not be connected to an active use port). 
In addition, we also record the status of each port at the end of the behavior. 
This information is computed with a simple traversal of the behavior graph, 
starting from the places that are active at the beginning of the behavior. 

In the example of the update for the assembly in Figure 1, localSeq de- 
termines that components dep1 and dep2 should each execute the sequence 
[update, deploy]: the first behavior is included in T'y;, and the second is re- 
quired to take the components to a state that satisfies I ports- 


3.2 Assembly-Level Reconfiguration Scheduling 


Once sequences of behaviors to execute over each component have been deter- 
mined, we turn our attention to the whole assembly and attempt to compute a 
sequence of reconfiguration commands (specifically, behavior requests and syn- 
chronization requests) that execute these behaviors. The challenge is to coordi- 
nate these behaviors in a way that satisfies all port requirements. To facilitate 
coordination and to restrict the search space, we specifically try to generate a re- 
configuration composed of steps, such that each component executes at most one 
behavior per step, and each step is followed by a global synchronization request. 
This assumption on parallelism is reminiscent of the BSP model [4]. Figure 2b 
gives an example of such a reconfiguration, to be compared with Figure 2a, which 
achieves the same result with fewer synchronization points. 
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pushB(server, suspend) 


pushB(server, suspend) waitAll() 

pushB(depi, update) pushB(depi, update) 
pushB(dep2, update) pushB(dep2, update) 
pushB(depi, deploy) waitAll() 

pushB(dep2, deploy) pushB(depi, deploy) 
wait (dep1) pushB(dep2, deploy) 
wait (dep2) waitAll() 
pushB(server, deploy) pushB(server, deploy) 
wait (server) waitAll() 


(a) Target reconfiguration program. (b) A reconfiguration with four syn- 
chronized steps. 


Fig. 2: A reconfiguration plan to perform updates on components dep1 and dep2 
of the assembly in Figure 1, then restore the system to a working state. 


SMT Constraints To find a reconfiguration plan, ordering constraints and port 
requirements are encoded as a problem in a many-sorted first-order logic (i.e., 
the logic is equipped with sorts that partition the domain, similarly to a simple 
type system), and an SMT solver is used to obtain a solution. That encoding 
of the scheduling problem centers around a sort Behavior, with a finite num- 
ber of elements that represent the behaviors to schedule. The main task of the 
SMT solver is to find an interpretation for a function schedule that maps be- 
haviors to a reconfiguration step during which to execute them. Conceptually, 
schedule could range over natural numbers, with behavior b executed at the ith 
step if i = schedule(b). However, such a model would require constraints with 
universal quantifiers over natural numbers, which pose a challenge for SMT 
solvers. It is also unnecessary, since there are only a finite number of behaviors 
to schedule: the number of steps required is at most the number of behaviors, 
when only one component executes a behavior at each step. If behaviors are 
executed in parallel over different components, fewer steps are required. Con- 
sequently, to improve the performance of the solver, the different steps of the 
reconfiguration are represented by another finite-domain sort Step, with elements 
step,,...step,, stepfinal. The element stepțfinai represents the ultimate state of the 
system rather than a reconfiguration step. Accordingly, the scheduling function 
has the signature schedule : Behavior — Step, and the problem contains the 
constraint schedule(b) Æ stepsinai for each behavior b. 


A successor function succ : Step — Step is needed to describe the effect 
of a reconfiguration step on the subsequent state of the system. Constraints 
succ(step;) = stepj+1 (for 0 < i < n), succ(step,) = stepfinas and succ(stepfinal) = 
stepfinal define the interpretation of succ. Likewise, to easily express sequentiality 
constraints, a function int : Step —> Int maps each step to its step number, as 
defined by constraints int(step;) = i. With this function, sequentiality is easily 
expressed: for any two consecutive behaviors bı and bọ in the sequence of be- 
haviors to schedule for a given component, the constraint int(schedule(b1)) < 
int(schedule(b2)) is added. This function reintroduces an infinite domain, which 
we sought to eliminate with the sort Step. However, since the problem contains 
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no quantifiers over integers, the solver only has to check that the aforementioned 
formula is satisfied by a speculated interpretation of schedule. This limited form 
of integer reasoning has a negligible impact on the search. 

The main difficulty in scheduling a reconfiguration lies in ensuring that 
ports requirements are satisfied for each behavior of a component. A predicate 
act, : Step — Bool is introduced for each (use or provide) port p to indicate the 
activity status of the port at the beginning of reconfiguration steps. The status 
of each port p after each behavior b is uniquely defined, as determined during 
the computation of the sequences of behaviors of the component to which the 
port belongs. Correspondingly, a constraint [—Jact,(succ(schedule(b))) is added 
to reflect that status. The square brackets denote the absence or presence of 
the negation, depending on whether the port is inactive or active at the end of 
the behavior. Conversely, the status of a port cannot change if its component is 
not executing a behavior. For a component with behaviors bj,...,b,, the con- 
straint schedule(b;) Æ step; A --- A schedule(b,) Æ step; =—> (actp(step;) <=> 
act,(succ(step;))) is added for every step i such that 0 < i < n. Ports require- 
ments can then be modeled. Let u be a use port that needs to be provided (i.e., 
connected to an active provide port) during behavior b, and p the provide port 
to which it is connected, the constraint act,(schedule(b)) ensures that p is active 
(and u provided) when b begins. Conversely, for a provide port p deactivated by 
a behavior b and connected to a use port u, sacty(schedule(b)) ensures that u 
is inactive when b begins. Furthermore, for any behavior b that activates a use 
port u and any behavior b’ that deactivates the connected provide port p, the 
constraint schedule(b) 4 schedule(b’) ensures that the behaviors are executed at 
different steps, hence separated by a synchronization barrier. 

The problem® is passed to an SMT solver. If satisfiable, the interpretation 
found for schedule is used to build a reconfiguration script such as in Figure 2b. 

Note that the scheduling problem could be encoded as a SAT problem. How- 
ever, SMT solvers can reason about the theory EUF (equality and uninterpreted 
functions) using a dedicated congruence algorithm. We also use (non-recursive) 
data types, for which some SMT solvers have a dedicated reasoning algorithm [3], 
to represent the domains of Behavior and Step. These capabilities allow us to en- 
code the problem straightforwardly and obtain solutions efficiently. Also note 
that the size of the scheduling problem is only a function of the number of be- 
haviors to schedule and the number of component ports, but does not depend 
on the internal complexity of components, so that optimized components with 
several parallel transitions will not adversely affect the synthesis method. 


3.3 Determining Missing Behaviors 


Until now, we have considered the scheduling problem under the assumption of 
a fixed sequence of behaviors to schedule for each component. In general, a set 
of behaviors may have no feasible schedule. For example, it is not possible to 


3Tllustrating instances for the running example, in the SMT-LIB file format, can be 
found at https: //doi-org/10.5281/zenodo.5820571. 
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fully execute the behavior update on components dep1 and dep2 of the assembly 
in Figure 1 without first deactivating the use ports service1 and service2 of 
component server, i.e., executing its behavior suspend. To plan reconfigurations 
for an incomplete set of behaviors, we use our SMT encoding of the scheduling 
problem to detect the point in the reconfiguration at which additional changes 
must be performed, then we create new component reconfiguration sub-problems 
and use the solutions to augment the sequences of behaviors to schedule. 


Let S be a mapping that associates to each component a sequence of be- 
haviors (i.e., the sequence to be executed by that component, as determined in 
Subsection 3.1), a mazimal executable schedule S’ of S is a mapping that asso- 
ciates to each component c a prefix of S(c), such that (i) the scheduling problem 
corresponding to the sequences in S’ has a solution (ii) no reconfiguration prob- 
lem built by extending a prefix in S’ with one behavior has a solution. Intuitively, 
a maximal executable schedule is a point up to which the reconfiguration S' can 
be carried out, before unsatisfied port requirements prevent further execution. 


Procedure 1 iteratively computes a maximal executable schedule S” and uses 
the resulting information to refine the sequences of behaviors to execute for 
each component, until a solution is found that executes them all. By analyzing 
the statuses of ports in the assembly at the end of the execution of 5’ (which 
depend only on the last behavior in each sequence), and comparing them to 
the requirements of the first unscheduled behaviors in S, we deduce a set of 
provide ports to activate and use ports to deactivate to allow further scheduling 
of S, and compute intermediary ports constraints I’ ports. For each component 
c that does not have unscheduled behaviors in S, we determine a sequence sı 
of behaviors that satisfies this intermediate goal (assuming that the component 
starts with active places act%, corresponding to its state after executing the last 
behavior in S’(c)) and a sequence sz that takes the component from its state 
after executing sı (active places act§,) to one that satisfies the port constraints 
Pports Of the original goal. Sequences of behaviors to execute are thus extended 
([] denotes the empty sequence, and s1 - 52 the concatenation of two sequences). 
To ensure a monotonic search, sequences are extended only for components c 
without unscheduled behaviors in S, i.e., not the components that brought about 
the intermediary goal I’,,,,s. If no such extension can be found (—progress), the 
scheduling of S is blocked by a circular dependency between components and the 
synthesis process fails. If the procedure terminates, it returns a reconfiguration 
script corresponding to a solution of the scheduling problem of S. 


Consider the example of running updates in the assembly of Figure 1. Initially 
(see Subsection 3.1), the mapping S of sequences of behaviors computed with 
localSeq is defined by S(dep1) = S(dep2) = [update, deploy], and S(server) = 
[], because I), does not include any behavior for that component, and the com- 
ponent is already in a state that satisfies I ports. This combination of sequences of 
behaviors has no feasible schedule. In particular, the mapping S’ that associates 
to every component the empty sequence is found to be a maximal executable 
schedule of S. The first unscheduled behaviors in S are two instances of update, 
they require use ports service1 and service2 of component server to be deac- 
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Procedure globalSolution(A, L vrv, F ports) is 
for c € A do S(c) + localSeq(c, acth, T onv, F ports); 
while findMarExecSchedule(S) # S do 
S’ + findMazEzecSchedule(S) ; 
I’ ports <- port conditions required to execute, for every component c, 
the first behavior in S(c) that is not in $’(c); 
progress < false ; 
for c € A such that $’(c) = S(c) do 
sı © localSeq(c, act, onv \S' (c), P’ports) ; 
s2 + localSeq(c, act$, , 0, P ports) ; 
if sı Æ [] or so # |] then 
S(c) — S'(c)+ 81+ 82 ; 
progress <— true ; 
end 
end 
if — progress then fail ; 


end 
return reconfigurationScriptOfSolution(S) ; 


end 


Procedure 1: Synthesizes a reconfiguration script. 


tivated. Consequently, two new reconfiguration sub-goals are created for server. 
The first requires it to reach a state where the two ports are deactivated, a call 
to localSeq returns the solution sı = [suspend]. From the resulting component 
state, the second reconfiguration sub-goal requires server to go to a state that 
satisfies [',orts, in this case localSeq returns the sequence s2 = [deploy]. S is 
updated so that S(server) = [suspend,deploy]. At this point, S is found to 
be a maximal executable schedule of itself, and the corresponding solution is re- 
turned, i.e., the reconfiguration plan in Figure 2b. Note that Procedure 1 is not 
guaranteed to terminate, nor is it a complete search algorithm. In particular, it 
relies on two heuristics: the selection function used when localSeq finds multiple 
candidate sequences, and the choice of maximal executable schedule for a given 
mapping S. 


Computing a Maximal Executable Schedule Procedure 1 relies on a function 
findMazExecSchedule to compute a maximal executable schedule of a mapping 
S, illustrated in Procedure 2, that maintains a mapping containing prefixes of 
elements in S (initially mapping every component to the empty sequence) and 
incrementally extends those prefixes, checking every time the satisfiability of 
the corresponding scheduling problem. This procedure calls the SMT solver to 
check the satisfiability of the scheduling problems. In the actual implementation, 
some simple checks are also used to quickly detect some trivially unsatisfiable 
or satisfiable instances of the scheduling problem, although these are left out of 
Procedure 2 for clarity. The procedure continues until all behaviors have been 
included or no additional behavior can be scheduled. A maximal executable 
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schedule always exists (the mapping that associates every component to the 
empty sequence always has a satisfiable scheduling problem, and may be maxi- 
mal), and findMaxExecSchedule always finds one. However maximal executable 
schedules are not unique, and a bad choice may result in an ineffective recon- 
figuration plan. In the example above, during the second iteration, the mapping 
S of sequences to schedule is defined by S(server) = [suspend, deploy] and 
S(dep1) = S(dep2) = [update, deploy]. S itself is a maximal executable sched- 
ule of S, but so is the mapping S’ defined by S’(server) = [suspend, deploy] 
and 5S’(dep1) = S’(dep2) = []. S’ corresponds to the case where the server 
is restarted too early. Picking this maximal executable schedule will ultimately 
lead to a reconfiguration that stops the server at least twice. To avoid this, a 
good heuristic for findMazEzecSchedule is to extend in priority the prefixes for 
which the added behavior is least likely to affect other components, i.e., those 
that deactivate the fewest provide ports and activate the fewest use ports. 


Procedure findMaxExecSchedule(S) is 
suffixes + S ; 
for c such that suffixes(c) is defined do prefires(c) + []; 
progress < true ; 
while progress do 
progress < false ; 
for c such that suffires(c) 4 [] do 
b + head(suffixes(c)) ; 
if the scheduling problem for prefixes extended with b is satisfiable 
then 
progress <— true ; 
prefixes(c) + prefixes(c) - [b] ; 
suffizes(c) + tail(suffixes(c)) ; 
end 


end 
end 
return prefixes ; 


end 


Procedure 2: Computes a maximum executable schedule for sequences 
of behaviors S. 


3.4 Relaxation of Synchronization Barriers 


The assumption that reconfigurations should proceed in globally synchronized 
steps, although useful to find a solution, severely limits the potential for inter- 
component parallelism, a key feature of Concerto. A final optimization stage 
takes the reconfiguration plan with synchronized steps and relaxes synchroniza- 
tion where possible. First, every command waitAl1() is replaced with a sequence 
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of commands wait (c) for every component c that executes a behavior in the pre- 
ceding step. This preserves the semantics of the reconfiguration and makes the 
targets of synchronization explicit. Then, for a given step 7 and a given command 
wait(c) after this step, we apply the following rule: if for all behaviors executed 
by c since the last command wait (c) up to step i, no provide (resp., use) port is 
deactivated (resp., activated) and connected to a use (resp., provide) port that 
is activated (resp., deactivated) at step į + 1, then wait(c) can be delayed until 
after step i + 1. This rule is applied for every step in order, delaying barriers 
as late as possible and removing duplicates. This transformation reduces the 
number of barriers yet ensures that behaviors with conflicting effects on ports 
remain separated by an explicit synchronization. Port requirements for behaviors 
do not have to be taken into account, as the Concerto execution model ensures 
implicit synchronization for those. As an example, this optimization applied to 
the reconfiguration plan in Figure 2b yields the one in Figure 2a. 


4 Experiments 


The implementation described here, the examples, and the experimental results 
are available at https://doi.org/10.5281/zenodo.5820571. 


4.1 Implementation 


We implemented the synthesis process in a Python tool that attempts to pro- 
duce a reconfiguration script for a given assembly and reconfiguration goal. The 
process is entirely automated. Given a description of an assembly and a recon- 
figuration goal, it generates relevant scheduling problems and interacts with an 
SMT solver to generate reconfiguration programs. Intermediate scheduling prob- 
lems can be output in the SMT-LIB file format, the standard used by most SMT 
solvers [2], and can be solved using any solver that complies with version 2.6 of 
the SMT-LIB standard. The preferred mode of operation for our tool does not 
output files, but interacts with the SMT solver Z3 [23] through the Z3 Python 
API. This interface makes it easy to analyze interpretations returned by the 
solver for satisfiable problems, and thus to reconstruct schedules. This is the 
mode of operation used to conduct the experiments described below. 


4.2 Results Over Synthetic Examples 


To test our technique on a variety of cases, we devised assemblies with four types 
of topology. In central-user assemblies, a set of provider components, each with 
a pair of provide ports, is connected to different use ports of one central user 
component. In central-provider assemblies, one central provider component has 
a pair of provide ports that is connected to (a pair of use ports of) multiple 
other components. In linear assemblies, components form a chain such that 
each component has a pair of provide ports connected to the pair of use ports 
of the next component. In stratified architectures, components are organized in 
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levels containing up to three components, such that each component in a level 
has a pair of provide ports connected to use ports on every component in the 
level above (i.e., a provide port can be connected to up to three use ports). 
Every component in these assemblies is equipped with behaviors to deploy it, 
update or suspend it, and uninstall it. Figure 3 depicts those four topologies, with 
internal nets of components omitted for clarity. As an example of the internal 
structure of components, Figure 1 shows the central-user assembly with three 
components. For other types and sizes of assembly, components follow similar 
internal structures, adapted to offer adequately many ports. 

For each architecture, we generated assemblies with 10, 30 and 100 com- 
ponents (scaling the number of providers for central-user, the number of users 
for central-provider, the length of the chain for linear or the number of levels 
for stratified), and ran three scenarios. The deployment scenario starts with all 
components uninstalled, ports requires the activation of a provide port on the 
last component(s) in the dependency order, while lenv is empty. The update 
scenario starts with all components running, ['y5,ts requires a similar final state, 
and Typ, includes update behaviors for components that are first in the depen- 
dency order and no behavior for the others. The uninstall scenario starts with 
all components running, I ports requires the deactivation of all ports, while Ppp, 
is empty. Each scenario affects every component of the assembly. 
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Fig. 3: The four assembly topologies in synthetic examples. 


Table 1 describes the solving process and resulting solution for these 36 exam- 
ples. Experiments were executed on a computer with an 8-core 1.6GHz processor 
and 16 GiB of RAM. Solutions were successfully generated for all but 4 exam- 
ples (the process was aborted after one hour). For 21 of them, the process took 
less than a minute. Results indicate that the solving time, and ultimately the 
success of the method, depend on the topology of the assembly: the assemblies 
for which some reconfigurations could not be computed within one hour are 
those with long chains of dependencies (linear and stratified assemblies with 100 
components). This can be explained in two ways: firstly, the propagation of port 
requirements and the deduction of missing behaviors requires a number of iter- 
ations of the main loop of Procedure | proportional to the length of the longest 
chain of dependency. Secondly, architectures with long chains of dependencies 
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are less conducive to parallel execution of behaviors, and therefore the instances 
of scheduling problems solved have a high number of steps, leading to a large 
search space and long solving times. For example, the deployment of 100 compo- 
nent in the linear architecture ultimately requires 100 steps. For each of the 17 
instances of the scheduling problem solved to compute that reconfiguration, the 
SMT solver took on average 147 seconds to return a solution. In contrast, to de- 
ploy 100 components in the central-user architecture, the reconfiguration script 
requires only 2 steps, as a result the SMT solver was able to return a solution 
after only 0.21 seconds on average each time it was called. For difficult problems, 
the solving time is dominated by calls to the SMT solver as shown in the solving 
time column of Table 1 (in parentheses, the cumulated time taken by the SMT 
solver). Overall, these examples show that our method is able to plan reconfig- 
urations affecting large number of components. Furthermore, architectures with 
a very large number of components, such as microservice architectures, typically 
have a shallow depth rather than long chains of dependencies, and scale horizon- 
tally [21,24,25], similarly to our central-user and central-provider architectures. 
Our method scales well in those conditions. 

Writing a correctly coordinated reconfiguration plan with tens of asynchro- 
nous behaviors is a non-trivial task. It is particularly difficult when explicit syn- 
chronizations commands are needed in the reconfiguration script. The execution 
model of Concerto ensures that this is seldom necessary, but some synchroniza- 
tion barriers are required, e.g., in the update scenarios to prevent early restarts 
that would block the updates. Our synthesis technique determines synchroniza- 
tion points required for completion of the reconfigurations, but it also avoids 
synchronization points that would slow the execution unnecessarily. It performs 
these tasks quickly, with a time gain that is especially significant when compared 
to the service interruption that an incorrect reconfiguration would cause. 


4.3 OpenStack Use Case 


We also tested our method on a real OpenStack system. OpenStack is the de facto 
standard open-source solution to address the IaaS level of the cloud paradigm, 
it can be seen as the open-source operating system of the cloud. 

In previous work [8], Madeus, a subset of Concerto restricted to deployment, 
was used to deploy an OpenStack system. Following the deployment strategy of 
the reference production deployment tool Kolla, 11 components were specified, 
resulting in a real OpenStack deployment up to 70% faster than Kolla. Here we 
use the same components, extended with behaviors for reconfiguration. We ana- 
lyzed the official installing, updating and uninstalling procedures of OpenStack 
to design the associated internal nets. Figure 4 depicts those components and 
their connections, with details of the internal structure depicted for the four main 
components, and omitted for clarity on seven others. The reconfiguration starts 
with all components running. The reconfiguration scenario requires an update of 
the database component (Tnvy = {(mariadb, update), (mariadb, deploy)}) and 
T ports Specifies that all ports must eventually return to their initial (active) state. 
Our method generates a reconfiguration plan in 1.95 seconds, correctly deducing 
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assembly solving plan 

arch. size| smt time (s) steps bhvs 

10| 2 (2) 0.25 (0.02) 2 10 

c-user 30] 5 (5) 1.99 (0.18) 2 30 

100| 17 (17, 23.94 (3.59) 2 100 

r 10| 2 (2) 0.33 (0.08) > 10 
g |c-provider 30| 5 (5) 3.95 (1.80) 2 30 
a 100/17 (17) 93.07 (68.67) | 2 100 
3 10] 2 (2) 0.40 (0.08) 10 10 
S| linear 30| 5(5) 9.19 (4.27) |30 30 
100| 17 (17) 2689.86 (2512.22)| 100 100 

10| 3 (3) 0.64 (0.09) 5 10 

stratified 30] 6 (6) 12.04 (5.28) 12 30 
100| 18 (18) 1274.52 (1121.40)| 35 100 

10| 6 (5) 1.69 (0.78) Z 20 

c-user 30/12 (11 25.07 (18.59) 4 60 

100| 36 (85) 1737.29 (1654.67)| 4 200 

10; 13 (4) L79 (0.41) 4 20 

o |c-provider 30 | 40 (11 22.98 (9.97) 4 60 
5 100/133 (34) 685.54 (541.69) | 4 200 
Q 10 | 50 (5) 12.72 (3.26) 20 20 
linear 30 |446 (11) 1388.50 (825.06) | 60 60 

100 = = = 2 

10 | 20 (6) 14.55 (5.82) 13 26 

stratified 30 |147 (16) 2306.85 (1885.17)| 34 86 
100 = = = E 

10| 3 (3) 0.54 (0.14) 3 II 

c-user 30| 6 (6) 6.01 (2.76) 3 31 

100| 18 (18) 162.51 (125.51) 3 101 

i 0 14) 0910.30) |3 19 
-£ |c-provider 30 | 10 (10 12.36 (7.25) 3 59 
ži 100| 34 (34) 571.73 (514.48) | 3 199 
z 10/10 (10) 1.30 (0.19) |10 10 
a linear 30 | 30 (30 39.31 (13.69) 30 30 
100 = = = = 

10) 4) 2.50 (1.10) 9 19 

stratified 30 |11 (11) 132.59 (108.53) | 23 59 
100 E = = = 
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Table 1: Results of the synthesis process on synthetic examples. For each prob- 
lem, the table indicates the architecture of the assembly and (arch.) its number 
of components (size), the number of problems solved by the SMT solver (smt) 
followed in parentheses by the number of those that were found satisfiable, the 
total solving time in seconds followed in parentheses by the cumulated time taken 
by the SMT solver (time), the number of steps in the solution before relaxation 
of the synchronization barriers (steps), and the number of behaviors executed in 


that solution (bhvs). 


284 S. Robillard, H. Coullon 


haproxy ° 
ovswitch + 
{memcached } 
rabbitmq * 
common : 


deploy 
interupt 
pause 
update 
uninstall 


Keystone da 


ulled 


deploy 


Fig. 4: A Concerto assembly for an OpenStack system. 


missing behaviors for mariadb as well as components affected by the interruption 
of its service, i.e., keystone, nova, neutron, and glance. The generated plan 
coordinates 12 behaviors on these 5 components. After optimization, it includes 
only 2 synchronization points needed to ensure the complete re-deployment of 
mariadb and keystone, whose services are required by other components. 

While the scale of this use case may seem limited, its architecture is not 
trivial. This real-world scenario leads to a complicated synchronization problem. 
The 12 behaviors in the reconfiguration program require 8 global synchronization 
steps before optimization. The optimization phase reduces this to 2 individual 
synchronization points, thus enhancing the level of parallelism and asynchrony of 
the reconfiguration program, while preserving its correctness. A DevOps engineer 
or system administrator would be challenged to write such a program without 
errors or unnecessary synchronization points, whereas our solution only requires 
them to specify a reconfiguration goal. 


5 Related work 


For models with fixed component life cycles, planning and scheduling techniques 
have been used to plan reconfigurations [1,13]. Pre-established protocols can 
also be used: while such solutions are in general less flexible, they have desirable 
features such as decentralized coordination [14] or recovery policies [6]. Com- 
paratively fewer works study the problem of reconfiguration planning in models 
with programmable component life cycles, such as Concerto. Kikuchi et al. [20] 
synthesize reconfiguration plans with a model finder. Unlike us, they assume that 
all available reconfiguration operations are given in the input of the scheduling 
problem, which may limit scalability. Operations and reconfiguration goal are 
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encoded in the Alloy specification language, and synthesis is performed by the 
Alloy Analyzer. This work relies on a simple ad hoc component-based model, 
with reconfiguration operations that must be sequentially ordered. The model 
does not have specific execution semantics, instead the list of operations has to 
be given by the user, with their effects described as constraints on the states 
before and after the operations. Therefore the correctness of the correspondence 
between the synthesized procedure and its executable counterpart depends on 
the user. Metis [22] closes that gap between planning and execution, as it sched- 
ules deployment plans for distributed systems in the Aeolus model [12], which 
has formal execution semantics. The authors first describe the problem as a 
generic planning problem and use standard planners to solve it, then present a 
specialized solving algorithm. Metis is limited to deployment rather than gen- 
eral reconfiguration, making the computation of dependencies more straightfor- 
ward. Aelous shares many similarities with Concerto, but lacks intra-component 
parallelism and asynchronous commands in its reconfiguration language. These 
features improve the efficiency of reconfigurations but also make them more dif- 
ficult to plan. Note that these features can also be represented through planning 
and scheduling problems [18], typically solved by approximation. 

The problem of determining reconfiguration goals (i.e., the analysis phase) is 
complementary to the planning problem. Engage [15] uses a SAT solver to build 
a complete target configuration (a set of components to deploy) from a partial 
specification, based on a hierarchical specification of a distributed software stack. 
It also performs limited planning, namely sequentially ordering deployments. 
Engage does not account for the state of the system, and is thus limited to 
initial deployments or reconfigurations from the ground up. Zephyrus [11] and 
ConfSolve [17] are two tools to infer, from the state of the system/environment, 
a target configuration that could be used as an entry of our planning tool. 


6 Conclusion 


We have described a synthesis method for reconfiguration plans of component- 
based systems, that relies on (i) finding local solutions at the component level, 
(ii) finding a schedule that coordinates those solutions at the assembly level, with 
the help of an SMT solver, (iii) determining unsatisfied dependencies to refine 
the reconfiguration goal until it becomes satisfiable, and (iv) optimizing the syn- 
thesized reconfiguration plan to improve its level of parallel and asynchronous 
execution. Dividing the problem in this manner, as opposed to attempting to 
solve it at once with an SMT solver, is a key to solving large instances, although 
it leads to incompleteness (the third step relies on an incomplete search guided 
by some heuristic choices). This design decision does not appear to affect the suc- 
cess of the method or the quality of synthesized plans, and allows the technique 
to scale to applications with large number of components, as demonstrated in 
our experiments on synthetic examples and a real use case. To improve scalabil- 
ity on complex architectures, this technique could be adapted to a hierarchical 
composition model, which would lend itself to a recursive resolution algorithm. 
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Abstract. Semantic clone detection is the process of finding program 
elements with similar or equal runtime behavior. For example, detecting 
the semantic equality between the recursive and iterative implementation 
of the factorial computation. Semantic clone detection is the de facto 
technical boundary of clone detectors. In recent years, this boundary has 
been tested using interesting new approaches. This article contributes 
a semantic clone detection approach that detects clones which have 0% 
syntactic similarity. We present Semantic Clone Detection via Probabilis- 
tic Software Modeling (SCD-PSM) as a stable and precise solution to 
semantic clone detection. PSM builds a probabilistic model of a program 
that is capable of evaluating and generating runtime data. SCD-PSM 
leverages this model and its model elements for finding behaviorally equal 
model elements. This behavioral equality is then generalized to semantic 
equality of the original program elements. It uses the likelihood between 
model elements as a distance metric. Then, it employs the likelihood 
ratio significance test to decide whether this distance is significant, given 
a pre-specified and controllable false-positive rate. The output of SCD- 
PSM are pairs of program elements (i.e., methods), their distance, and a 
decision on whether they are clones or not. SCD-PSM yields excellent 
results with a Matthews Correlation Coefficient greater than 0.9. These 
results are obtained on classical semantic clone detection problems such 
as detecting recursive and iterative versions of an algorithm, but also on 
complex problems used in coding competitions. 


Keywords: semantic clone detection - probabilistic software modeling 
- clone detection 


1 Introduction 


Copying and pasting source code fragments leads to code clones, which are con- 
sidered an anti-pattern. Code clones increase maintenance costs [31,32], promote 
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bad software design [29,13,17], and introduce or propagate bugs [4,28,14]. How- 
ever, duplicating code fragments also allows faster adaptation to requirements, 
the re-use of stable and well-tested solutions [25,26], and helps to overcome 
language limitations [21,35], thereby lowering development costs. The impact 
of code clones and the contradicting evidence various studies provide are the 
topics of an ongoing discussion in the community. Meanwhile, it is certain that 
developers will continue duplicating source code to leverage its benefits, despite 
its drawbacks. The key is the awareness and management of clones to maximize 
efficiency while balancing quality. 


Traditionally, the clone taxonomy distinguishes between four types of clones 
[35,2,34]. Type 1-3 describe code clones caused by copying and pasting the 
source code with or without changes. Type 4 clones describe code clones that 
do not have any syntactic similarity but implement the same functionality 
(semantic equivalence). For example, the recursive and iterative implementation 
of an algorithm (e.g., Fibonacci computation) have no syntactic similarity while 
implementing the same functionality. Existing tools have limited or no capabilities 
to detect Type 4 clones [19]. Most current studies exclude them because of the 
lack of tool support [23,35,2,39,11]. Nevertheless, Type 4 clones exist, and recent 
research efforts have tried to deepen the understanding of them [19,49,20]. This 
article provides a significant contribution to semantic clone detection in the form 
of novel concepts and a prototype implementing them. 


We present Semantic Clone Detection via Probabilistic Software Modeling 
(SCD-PSM). SCD-PSM extends our work on Probabilistic Software Modeling 
(PSM) [43] via a semantic clone detection pipeline. PSM builds Probabilistic 
Models (PMs) from programs. It analyzes the static structure and dynamic run- 
time behavior and replicates the program in the form of a generative probabilistic 
model. These models allow developers to reason about the semantics of a program. 
SCD-PSM extends this work by leveraging the PMs and causal reasoning to 
find semantically (i.e., behaviorally) equivalent code elements. SCD-PSM allows 
full quantification of the behavioral distance of code elements via likelihoods. 
Furthermore, the likelihood evaluation via PMs allows for statistical significance 
tests to decide whether a pair of code elements are clones. SCD-PSM detects 
semantic clones with no textual similarity, such as the iterative and recursive 
version of an algorithm. The average performance of the approach reaches a 
Matthews Correlation Coefficient of 0.965 on a complex problem set indicating 
a robust method for semantic clone detection. This work extends our previous 
work [41] with a full evaluation and the theoretical foundation. 


Section 2 provides the background needed to understand SCD-PSM including 
the basics of PSM. Section 3 clarifies what semantic clones are in the context of 
this work. Section 4 presents the approach in which representation, search space, 
and the various similarity stages are described. Section 5 evaluates the approach 
while Section 6 discusses the results. Limitations of the approach and possible 
threats are given in Section 7 and Section 8. Section 9 compares the work to the 
state-of-art and Section 10 concludes this article. 
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1 int falint n){ 

2 product = 1 

3 for(i = J; i- -<= n a+) 
4 product *= i 

5 return product 


6 } 


Listing 1.1: for-loop implementation 
of factorial 


1 int fb(int n){ 

2 product = 1 

3 isi 

4 while(i <= n) 
5 product *= i 
6 i++ 

7 return product 


8 } 


Listing 1.2: while-loop implementation 
of factorial 


int fe(int n)t 
if(m <= 1) return 1 
return fcin - 1) * n 


J 


e UNH 


Listing 1.3: Recursive implementa- 
tion of factorial 


1 int fd(int n, String guard){ 

2 if (n < 1 && guard == "val") 

3 return =1 

4 if(n < 1 && guard == "throw") 


throw Exception () 
6 return fc(n) 


7 } 


Listing 1.4: Delegate implementation of 


factorial 


2 Background 


The clone detection research community has a long history and defines many 
concepts, algorithms, and tools. In contrast, Probabilistic Software Modeling 
(PSM) is relatively new and combines software engineering and probabilistic 
modeling. Some terms need clarification; others require an introduction if they 
diverge from their traditional names. 


2.1 Clone Detection 


Clone detection is the process of finding two similar program fragments. List- 
ings 1.1 to 1.4 are four different implementations of the factorial function (n!). 
Listing 1.1 is a for-loop implementation, Listing 1.2 uses a while-loop, and List- 
ing 1.3 is recursively defined. Finally, Listing 1.4 delegates its implementation to 
fcQ from Listing 1.3 but may also return —1 in case of invalid inputs (including 
n=O). 

Representation, pairing, similarity evaluation, and clone decision are the 
core concepts of clone detection. Representations describe on which artifact 
the detector operates, such as text, graphs (e.g., AST), or probabilistic models. 
Pairing describes the selection of two code fragments that are potentially clones 
(e.g., fa() and fb()). Each pair is called a candidate clone pair (or candidate 
pair). The similarity evaluation measures the similarity of a candidate pair, e.g., 
by counting the number of different characters. Finally, the clone decision labels 
the candidate pair as a clone given a criterion on the similarity, e.g., less than 
ten different characters. 

The properties of the similarity metric split clones into two groups [35]. 
Type 1-3 clones capture textual similarity while Type 4 clones capture semantic 
similarity [2,23,24,35,34,44]. Type 1 (Exact Clones) clones are program fragments 
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that are identical except for variations in white-space and comments. Type 2 
(Parameterized Clones) clones are program fragments that are structurally or 
syntactically similar except for changes in identifiers, literals, types, and comments. 
Type 3 (Near-Miss Clones) clones are program fragments that include insertions 
or deletions in addition to changes in identifiers, literals, types, and layouts. 
Type 4 (Semantic Clones) clones are program fragments that are functionally 
or semantically similar (i.e., perform the same computation) without textual 
similarities. These types are increasingly challenging to detect, with Type 4 being 
the most complex one. Note that the definition of Semantic Clones is often 
relaxed, where up-to 50% syntactic similarity of the code fragments is allowed 
(e.g., BigCloneBench [39]). However, we consider these clones as complex Type 3 
clones (additions, deletions, reordering) and not as semantic clones. This means 
that semantic clones in the context of this work are clones with no syntactic 
similarity except for per-chance similarities. 

We will use a ~ b to denote that a is a clone of b. Furthermore, a % b denotes 
that a is not a clone of b. 


2.2 Programs & Code Elements 


PSM generalizes object-oriented terms to describe code elements in a program. 
Code elements are types T, properties Pr, and executables Ex that refer to, 
e.g., classes, fields, and methods in Java [1], or classes, properties, and functions 
in Python [45]. Additional code elements are parameters Pr and results Re of 
executables that refer to parameters and return values of a method. Properties, 
parameters, and results are atomic code elements that have identifiable states at 
runtime. Types and executables are compositional elements that act as a collection 
of atomic elements. Types declare properties and executables, capturing structural 
relationships. Executables have behavioral relationships that are categorized into 
Inputs (I) and Outputs (O). Inputs are received parameters Pa*, read properties 
Prt, and requested invocation results Re. Outputs are returned executable results 
Re®, written properties Pr°, and provided parameters Pa. We will denote 
atomic elements in lowercase, and compositional elements in bold-face lowercase, 
e.g., nand fa in Listing 1.1. Executable results are named after their executables, 
e.g., fain Listing 1.1. fe = {n?**, feet, fc®} denotes the code elements of 
Listing 1.3. For the sake of readability, we will omit the superscript classifiers if 
it is unambiguously possible, e.g., fa = {n, fa}. The subset of inputs is denoted 
by fe? = {nP%*, fc®7} and outputs by fe? = {fc®*°}. Finally, the set of all 
input and output combinations is given by bmex?? = {(i,0) € ex? x ex? }. For 
example, fd?° = {(n, fd), (guard, fd)} describes the IO pairs of faQ. 


2.3 Probabilistic Software Modeling 


Probabilistic Software Modeling (PSM) [40] is a data-driven modeling paradigm 
that transforms a program into a Probabilistic Model (PM). PSM extracts the 
structure and behavior of a program. Code elements and their dependency graph 
represent the structure as described in Section 2.2. All observable events at 
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runtime represent the behavior. The resulting PM and its model elements are a 
probabilistic copy of the program. 

Model elements in the PM are the equivalent to code elements in the program. 
P(x) denotes the probability distribution of variable x, e.g., Pfa(n) denotes the 
probability distribution of input parameter n of the fa-method. p(x) denotes 
the probability of a specific event of a variable, e.g., pfa(n = 2). This extends 
the notation of code elements with probabilistic quantities. However, the nota- 
tion reasons about the probabilistic behavior of code elements instead of their 
structural properties. 

Each model element is a flow-based latent variable model [7] that learns an 
invertible mapping between the original observations and an isotropic unit norm 
Gaussian N (0,1) with f : X ++ Z. An example for x € X may be n € fa with 
n?” € fa’ being its latent Gaussian representation. The Gaussian latent space 
enables the model elements to generate new samples and evaluate the likelihood 
of samples. 

Generation (or Sampling) draws, either marginally or conditionally, obser- 
vations from a model element simulating the execution of the corresponding 
code element. For example, drawing 100 observations from fa ~ Pya(n, fa), i.e., 
values for n? and fa®, simulates 100 program executions of this method. An 
example for conditional generation would be fajn<io ~ Pra(fa |n < 10) that 
only draws observations where n < 10. The process involves sampling from the 
latent Gaussian variables, and inverting the Gaussian samples to the original 
domain via the flow f~!(z) = x. Evaluation takes observations and evaluates 
their likelihood under a model element. For example, Pfa(n = 4, fa = 24) evalu- 
ates the likelihood of input 4 and output 24 under the fa model element. The 
process of evaluation involves mapping a given sample into the latent space and 
evaluating it under the Gaussians pyy(o,1)(f(#)). Generation and evaluation are 
the core of any PSM applications and of SCD-PSM. A detailed description is 
given in our previous work [43]. 


3 Semantic Clones 


A clear understanding of what SCD-PSM defines a semantic clone is essential in 
understanding the approach and its design choices. 


Definition 1. A semantic clone is a pair of executables whose (partial) input, 
and output relationships exhibit significant (conditional) similarities. 


Definition 1 defines semantic clones over the similarity between IO relationships 
of executables. This holds if the IO relationships are only partially similar, i.e., 
not all combinations of IO pairs between executables have to be similar. For 
example, fd in Listing 1.4 has two IO pairs (fd7° = {(n, fd), (guard, fd)}) 
while fa in Listing 1.1 has one IO pair (fa?° = {(n, fa))}). According to the 
definition, at least one IO pair comparison needs to be similar such that both 
executables are declared as a semantic clone (e.g., (n, fd) ~ (n, fa)). 
Furthermore, the similarities between IO pairs may only be conditional, i.e., 
the similarity of matching IO pairs might be depending on the state of any other 
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code element in the comparison context. For example, the IO pair (n, fd) ~ (n, fa) 
is only a perfect clone in case that fd.guard != "val". If fd.guard == "val", 
the IO behavior would differ in case of n = 1 (fd(1) + —1 while fa(1)¥ 1). 
According to the definition, at least parts of the behavior need to be similar, 
capturing complex multidimensional behavioral patterns in IO relationships. 

The rationale behind the comparison of IO relationships is one of cause and 
effect. If a pair of executables exhibit similar effects given similar causes, then 
their computational behavior is identical. Extending this rationale by multiple 
inputs and outputs leads to partial conditional similarity. 
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Fig. 1: The modeling phase transforms the program into a PM. The search space 
phase then pairs the PM model elements into candidate pairs. Finally, Static-, 
Dynamic- and Model Similarity evaluate the behavioral equality of the candidates. 


Figure 1 illustrates SCD-PSM. It is a five-fold approach consisting of the 
following steps: 


A. [Modeling] PSM builds a probabilistic model that reflects the original 
program; 

B. [Search Space] A search space of candidate pairs is constructed by pairing 
executable model elements; 
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C. [Static Similarity] The static similarity stage accepts candidate pairs with 
matching data types; 

D. [Dynamic Similarity] The dynamic similarity stage accepts candidate pairs 
with similar runtime data; 

E. [Model Similarity] The model similarity stage accepts candidate pairs with 
similar model behavior. 


The approach represents a rejecting filter pipeline that candidate pairs must 
traverse in order to be declared a clone. Static-, Dynamic-, and Model Similarity 
represent filter stages of increasing complexity. 

The main contribution of this work is the implementation of a semantic clone 
detection pipeline on top of PSM. Further, we provide an effective process of 
traversing the potentially large search space of candidate pairs. Finally, we show 
that the behavioral equivalence of model elements generalizes to the semantic 
equivalence of code elements. 


4.1 Modeling 


Starting from the Source Code in Figure 1, PSM builds a Probabilistic Model 
(PM) [40] of the program (1). The PM is also called the Inference Graph (IG), 
which is a cluster graph [22] with Non-Volume Preserving Flows (NVPs) [7] as 
clusters. SCD-PSM uses this PM as a representation for the clone detection, 
similar to text-based clone detectors that use text fragments. The PM is the 
output of PSM and is considered as given in the context of SCD-PSM. 

Executable model elements in the PM act as a surrogate to the executables in 
the program. SCD-PSM pairs these model elements and computes their similarity. 
If a behaviorally equivalent model element pair is found, then it can be seen as a 
semantically equivalent code element pair. In conclusion, the SCD-PSM allows 
for method-level semantic clone detection based on PMs representing the original 
executables in the program. 


4.2 Search Space 


Program BES, WES WES, IO 
J Se = q =) = inputs 
, _ o o) 
JOP link ]OR outputs 


Fig. 2: SCD-PSM operates on four levels of abstraction: program, between exe- 
cutable, within executable, and the IO level. 


SCD-PSM conducts method-level semantic clone detection, which operates 
on multiple abstraction levels. Figure 2 illustrates these levels, starting with the 
program and ending with the inputs and outputs of an executable. 
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The second step in Figure 1 builds a within- and between-executable space 
that SCD-PSM searches for clones. The Between-Executable Space (BES) is the 
set of executable combinations 


BES = {{a,b} € Ex x Ex | a F b}, (1) 


where exa,exb is a candidate pair (or executable pair), and Ez is the set of all 

executables in the current analysis (illustrated in Figure 2). The theoretical size of 

the between-executable space are all 2-length combinations without replacement, 

given by 

mie or 2 
-(|Ea| — 2)! 


where |:| describes the size of the underlying set. Note that the size of the BES 
is smaller than the Cartesian product since {a,b} = {b,a}. Figure 1 shows 
this pairing process in the Search Space aspect (2) from Figure 1. The Within- 
Executable Space (WES) is the product of IO pairs 


|BES| = 


WES® = {(i,j) € a7 x bO}. (3) 


Figure 2 illustrates the WES and one IO pair from the WES that we also call 
link. The theoretical size of the within-executable space is 


|WES®| = |a?°| - |br| (4) 


For the sake of visualization, IO pairs are not shown in Figure 1 but are abstracted 
in their executable elements. The maximum theoretical search space is 


S= 2 |wes(BES;)|, (5) 


given that wes describes a construction function according to Equation (3), and 
BES; is the ith candidate pair. 

In practice, SCD-PSM evaluates only a fraction of possible combinations 
because of the skip evaluation. The skip evaluation consists of two search space 
limiting factors: greedy evaluation and transitive similarity. Greedy evaluation 
stops the search through the WES once a similar pair is found. The initial 
detection process only confirms the similarity of a candidate pair. A post-analysis 
can then extract all possible IO similarities for potential actions. Transitive 
similarity skips evaluations in the BES, because of a ~ b œ c then also a ~ c 
holds. In conclusion, SCD-PSM compares IO pairs of executable model elements 
and uses skip evaluation to traverse the search space efficiently. 


4.3 Static Similarity 


The static similarity stage is a filter that accepts candidate pairs based on their 
data type, as shown in Figure 1. Data types in a PSM model are integers, floats, 
and text. 
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Input (3) of the stage are the IO pairs WES@® = wes({a, b}) of a candidate. 
The filter criteria (4) accepts a candidate pair if at least one link (i.e., IO pair) 
has a matching data type, i.e., the input but also the output have a matching 
data type. Output (5) is a boolean decision whether the candidate pair is a clone 
or not from a static viewpoint. If positive, then the candidate pair is moved to 
the next pipeline stage, i.e., the Dynamic Similarity evaluation (see Figure 1). 
If negative, then the candidate pair is marked as being not a clone a % b and 
no further processing is conducted. For example, the IO pairs (n, fa) ~ (n, fb) 
would be statically accepted as clones as both inputs and outputs have the same 
data type (integer). A counterexample is given by (n, fa) ~ (guard, fd) where 
the input data types are integers and text. 

The static similarity indicates that the analyzed program is given in a program- 
ming language that allows for static analysis. Programs written in programming 
languages without static typing can not make use of this filter stage. In conclusion, 
the static similarity stage filters candidates based on their data type. 


4.4 Dynamic Similarity 


The dynamic similarity stage is a filter that accepts candidate pairs based on 
the runtime data, as shown in Figure 1. Candidates pairs are accepted if at least 
one IO pair (6) has an insignificant diverging runtime distribution (7). This 
boolean decision is evaluated via a Kolmogorov-Smirnov test [30], and determines 
whether a pair is a clone from a dynamic viewpoint (8). For example, the IO pair 
(n, fa) ~ (n, fd) with guard == true would be excluded form the filter given 
that runtime events with n = 0 reach a majority. In comparison, (n, fa) ~ (n, fb) 
would be accepted by the stage. 

A requirement is that the candidates use a synthetic trigger. Otherwise, the 
comparison of the data distributions may fail because of the different modus 
operandi of the program. For example, running fa and fb where n fa = U(0, 4) and 
nfo =U(5, 10) would cause the dynamic stage to fail even if the implementations 
are equivalent. Property-based [12] or random testing can be used to generate 
diverse synthetic inputs. 

In conclusion, the dynamic similarity stage pre-filters candidates based on 
univariate tests on the input and output events. 


4.5 Model Similarity 


The model similarity stage is a filter that accepts candidate pairs based on 
the models, as shown in Figure 1. This stage conducts a multivariate test by 
sampling from the executable models and cross evaluating them. This test 
includes the evaluation of conditional influences caused by elements that are not 
actively participating in an IO pair. For example, (n, fd) ~ (n, fa) holds but is 
conditionally dependent on guard. The model similarity can factor guard into 
its decision while the dynamic stage can only evaluate the average behavior of 
an IO pair. 

Input (9) are the IO pairs of a candidate WES@ = wes({a, b}). The cross- 
wise log-likelihood ratio of the models is computed by (conditional) generation 
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and evaluation. Output is a boolean decision on whether the candidate pair is a 
clone or not, from a model viewpoint. Figure 1 illustrates the entire process of 
the model similarity. 


) A reference M"™“! = a and an alternative model M° = b is selected. 
) An IO-pair p = W ES?” is selected as the target of the comparison (link). 
) A reference sample D”*! is drawn from M"™! (10). 
An alternative sample D@!I"™! is drawn from M* by optimizing towards 
g 
the p dimensions in the D’™”’, effectively conditioning the drawn samples 
(11). 
(E) D"™"” is evaluated under M"™! resulting the reference log-likelihood LL”! 
(F) Dvtlrull is evaluated under M@ (12) yielding the alternative log-likelihood 
LP”! . 
(G) Finally, the likelihood ratio of the link is computed by ÀA = LL@* — Lpne"! 


A 


( 

(B 
(C 
(D 


The roles between the null and alt models are then swapped, and the process is 
repeated. Both log-likelihood ratios are then combined by a pooling operator to 
produce the clone decision (14). 

The role-swap is needed to avoid sub-model relationships. For example, if 
Mn! = N (0,3) and M@* = N (0,1) then LL** will be very high because Mett 
is a sub-model from M"™", Reversing the roles highlights the differences in the 
models. 

The final decision is based on the Generalized Likelihood Ratio Test (GLRT) 
[10]. It measures whether the log-likelihoods are significantly different from 0, 
where A is the test statistic. The null hypothesis is rejected for small ratios A < c 
where c is set to an appropriate false-positive rate. For example, A < log(0.01) 
allows 1 out of 100 candidates to be a false-positive, i.e., wrongly rejecting 
semantic equivalence. The pooling operator combines the link results either via 
hard or soft pooling. Hard pooling conducts for both links a GLRT yielding a 
positive decision if both links are positive. Soft pooling averages the link log- 
likelihoods ratios and then computes the GLRT yielding a positive decision if the 
joint GLRT is positive. Hard pooling does not allow any sub-model relationships, 
while soft pooling relaxes this constraint. 

In conclusion, the model similarity conducts a multivariate significance test 
between two models, including possible conditional dependencies. 


5 Study 


This study answers the following research questions. 


Q1 Does behavioral equality between model elements generalize to semantic 
equality of code elements? 

Q2 Does the skip evaluation significantly reduce the computational demand of 
SCD-PSM? 

Q3 Does the skip evaluation negatively impact the detection performance (i.e., 
precision, recall, and MCC)? 
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Q1 answers whether semantic clones can be detected via SCD-PSM. Q2 answers 
whether the search space can be efficiently processed using skip evaluation. Q3 
answers how the skip evaluation influences the performance of the detection 
process. This is important because candidate pairs might be skipped based on 
false-positives or false-negatives. 


5.1 Setup 


We implemented a prototype for SCD-PSM on top of Gradient [40], a prototype 
for PSM. The elements and data flow of the detection process are shown in 
Figures 1 and 2. 


1. The input Source Code were 13 different clone classes with a total of 108 
implementation variants. This includes classical algorithms implemented 
recursively and iteratively such as bubble sort, as well as hard problems from 
the programming competition Google Code Jam}. 

2. The Probabilistic Model was computed via Gradient, a PSM prototype. We 
used the same hyper-parameters as reported in our previous work [43]. 

3. The Search Space, i.e., the BES and WES, was created according to Section 4.2 
based on all available examples. 

4. Each valid candidate pair was then submitted to the Static-, Dynamic, - 
and Model-Similarity stages and filtered according to Sections 4.3 to 4.5. 
Candidates that passed the entire filter pipeline were marked as clones. 


5.2 Dataset 


The study uses three well-known algorithms and 10 Google Code Jam 2017 
(GCJ)! problems. The total dataset contains 108 implementation variants across 
13 clone classes described by Instance. 

Each clone class was differentially tested to verify the behavior across in- 
stances. Factorial, Fibonacci, and Sort do not need any further explanation. The 
GCJ problems are well specified complex optimization problems packaged in an 
everyday theme. 

The dataset contains in total 5778 (see Equation (2)) candidate pairs of which 
458 are semantic clones and 5320 are not. This yields a positive to negative ratio 
of 1 : 11.6, indicating a highly imbalanced distribution. An even more pronounced 
imbalance is to be expected in real-world applications. 

Each instance was triggered with input data to allow PSM to model the differ- 
ent implementations. Factorial, Fibonacci, and Sort were triggered by sampling 
from a uniform distribution /(0,20). GCJ problems were triggered by the input 
data provided by the competition. Each instance received the same trigger. 

GCJ problems read from and write to the standard stream, which is im- 
practical in terms of reproducibility. Our dataset is constructed such that each 
implementation has a run-method representing the cloned executable. The study 
results are limited to the run-method even if the solutions use helper methods. 


1 https: //codingcompetitions.withgoogle.com/codejam /archive 
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Helper methods may, for example, be methods that compute parts of the final 
solution, or reorganize the data. This guarantees a proper problem scope, a 
well-defined recall and precision, and a clearly defined benchmark for future 
reproducibility. 


5.3 Controlled Variables 


The study controls for the search space Evaluation strategy, Dynamic False- 
Positive Rate (D-F PR), Model False-Positive Rate (M-FPR), and Pooling. 


Evaluation describes how the search space is processed: exhaustive, or skip. The 
exhaustive evaluation compares each executable candidate with each other. 
The skip evaluation uses the transitive similarity (see Section 4.2) and may 
skip evaluations if possible. 

Dynamic False-Positive Rate (D-FPR) defines the critical value a of the 
Kolmogorov-Smirnov test with 0.001 and 0.01, at which similarity is rejected. 

Model False-Positive Rate (M-FPR) defines the critical value c of the Gen- 
eralized Likelihood Ratio test with 0.001 and 0.01, at which similarity is 
rejected. 

Pooling defines how the likelihood ratios from the two link directions are 
combined (see Figure 1, (8)) with values: hard, or soft. Hard pooling evaluates 
whether each link reaches the critical value c and accepts the clone if both 
links evaluate as positive with ALinka < og € and ÀLinks < eB Soft pooling 

evaluates the average log-likelihood ratios (geometric mean of likelihoods) 


Shinta Onte < loge, and compares it against the critical value c. 


An additional fixed parameter is the number of particles. It defines the sample 
size that is generated during the model similarity |D| = 50. 


5.4 Response Variables 


The response measures of the study are the number of Skip Evaluations, processing 
Duration, TP, FP, TN, FN, Precision, Recall, F1, and Matthews Correlation 
Coefficient. 


Skip Evaluations measures the number of evaluations that were skipped due 
to the skip evaluation strategy. 

Duration measures the elapsed time to compute one candidate pair. 

TP, FP, TN, FN measures the True Positive (TP), False Positive (FP), True 
Negative (TN), and False Negative (FN) detection results compared to the 
ground truth. 

Precision measures the fraction of detected clones that are truly clones. 

Recall measures the fraction of semantic clones that have been found. 

F1 measures the accuracy of a binary classification as the harmonic mean of 
recall and precision. 
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Table 1: Results of the top-5 and bottom-1 experiment along with the average 
performance of the top-5. 


Controlled Variables Response Variables 
Nr Evaluation D-FPR M-FPR Pooling|Duration TP FP TN FN Skip Precision Recall F1 MCC 
1 skip 0.100 001 soft 560 437 05320 21 345 .000 0.954 0.977 0.975 
2 skip 0.010 001 soft 620 437 05320 21 345 .000 0.954 0.977 0.975 
3 exhaustive 0.010 001 soft 680 425 05320 33 0 1.000 0.928 0.963 0.960 
4 skip 0.010 0.010 soft 920 423 05320 35 332 1.000 0.924 0.960 0.958 
5 exhaustive 0.100 001 soft 2040 421 05320 37 0 1.000 0.919 0.958 0.955 
16 exhaustive 0.100 0.010 hard 2820 293 0 5320 165 0 1.000 0.639 0.780 0.787 
1-5 skip 0.010 001 soft 740 428 05320 29 340 .000 0.936 0.967 0.965 


Duration in seconds 


Matthews Correlation Coefficient (MCC) measures the quality of the clone 
detection in the form of a correlation ranging from —1 to 1, with 0 being a 
random selection. The MCC will be the reference performance metric as it is 
the most robust metric in an imbalanced binary classification setting [3]. It is 
a correlation coefficient which may be interpreted by the guidelines proposed 
by Evans [9]. 


5.5 Comparison of Clone Detectors 


In total, eight alternative approaches are used to contextualize the performance of 
SCD-PSM. The alternatives have a wide variety in terms of internal representation 
and clone detection capabilities as listed in Table 3. ASTNN (8) and ASTNN 
Leaky (9) are the same approach but have different evaluation methods. ASTNN 
Leaky (9) uses a random split of the dataset as reported by the authors [50]. It 
overestimates the performance of the approach via a lack of isolation between 
training and test dataset. For example, fa ~ fb and fa ~ fc might be in the 
train split while fb ~ fc might be in the test split. ASTNN (8) uses a group-wise 
Cross Validation (CV), where clone classes are entirely isolated either into the 
training or test proportion of the dataset. This represents a real-world situation 
where first the detector is fitted and then applied to a new system with unknown 
code fragments. 

Detectors that report lines instead of methods may produce more results (TP, 
FP, TN, FN) than present in the dataset. A similar situation is given by ASTNN 
Leaky that runs multiple evaluations via the cross validation. 


5.6 Experiment Results 


Creating the PSM model with Gradient took 2134.38 s, resulting in an average 
modeling time of 19.75s for the 195 executables. This includes 87 helper methods. 

Table 1 contains the aggregate results of the top-5 experiments along with 
the results of the worst experiment. The bottom line in Table 1 is the average 
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Table 2: Performance breakdown of the best performing experiment listed as 
Nr. 1 in Table 1. 
Stage Duration TP FP TN FN Precision Recall F1 MCC 


initial — 458 5320 0 0 0.079 1.000 0.147 

static 0.0001 458 1504 3816 0 0.233 1.000 0.379 0.409 
dynamic 0.208 451 505270 7 0.900 0.985 0.941 0.936 
model 1.749 437 0 5320 21 1.000 0.954 0.977 0.975 


0.344 437 0 5320 21 0.996 0.954 0.977 0.975 


Duration in seconds 


performance of the top-5 experiments. The generally expected performance of 
the approach is very strong with an MCC of 0.965. High confidence for negative 
examples is given with no false-positives reflecting the pipeline’s FPR rates 
(D-FPR x M-FPR). The best experiment featured a skip evaluation, 0.100 
D-FPR and 0.001 M-FPR rates, and soft pooling (Nr. 1) with an MCC of 0.975. 
The worst experiment featured a exhaustive evaluation, D-FPR of 0.100, M-FPR 
of 0.010, and hard pooling (Nr.16) with a strong MCC of 0.787. A total of 345 
candidates were skipped while reaching a recall of 0.933. 


Table 2 lists the cumulative performance of the best model, starting with 
an initial prediction that all candidates are semantic clones (rejecting pipeline). 
The static stage finds 71.729 % (3816) of the FPs, improving the MCC by 0.409. 
The dynamic stage additionally removes another 27.330% (1454) of FPs but 
introduces 1.528% (7) of the possible FNs. An improvement of the MCC by 
0.527 is achieved via the dynamic stage. Finally, the model stage removes the 
remaining 0.939% (50) FPs but introduces additional 3.056% (14) additional 
FNs. The model stage improves the MCC by 0.039. 


On average, 5.884% (340) of the total 5778 evaluations could be skipped. 
This equals 74.235 % of the total 458 TPs. On average 37.359 % (50354) of the 
total 134782 IO pair evaluations could be saved via greedy evaluation. The 
average duration of the exhaustive experiments was 2394s, leading to 414ms per 
candidate. Skip experiments lasted on average 1988s with 344 ms per candidate. 
The static stage lasted on average for <0.001% of the time per candidate (see 
Table 2), the dynamic stage for 0.106%, and the model stage for 0.893 %. 


Table 3 lists the detection results of eight alternative clone detectors. Simian, 
NiCad, and CCAligner found no clones in the dataset. PMD, SourcererCC, Oreo, 
and iClones found some clones (< 20) with a low recall (4%). Each of these 
detectors has a very weak performance below an MCC of 0.20 ASTNN with the 
leaky evaluation has a very strong performance with an MCC of 0.976. ASTNN 
3-Group CV has a strong performance with an MCC of 0.711. The longest 
computational duration is given by ASTNN with 1034 min. 
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Table 3: Detection results of other clone detectors on the dataset. 


Nr Tool Note Repr. Type|Duration TP FP TN FN Precision Recall F1 MCC 
1 Simian [16] Text 1 0.188 0 05320 458 0.000 

2 NiCad [5] Text 3 1.291 0 05320 458 0.000 

3 CCAligner [47] Token 3 1.109 0 45316 458 0.000 0.000 -0.007 
4 PMD [33] Token 2 1.389 8 12 5308 450 0.400 0.017 0.033 0.069 
5 SourcererCC [37] Token 3/4 36.86 10 0 5320 448 1.000 0.021 0.042 0.142 
6 Oreo [36] Model 3/4 79.00 17 55315 441 0.772 0.037 0.070 0.158 
7 iClones [15] Token 3/4 0.980 13 0 5320 445 1.000 0.028 0.055 0.161 
8 ASTNN [50] 3-Group CV Model 4 1034 296 29 1415 162 0.911 0.646 0.756 0.711 
9 ASTNN (Leaky) Random Split Model 4 2028 442 45316 16 0.991 0.965 0.978 0.976 
10 SCD-PSM Top 1-5 Model 4 1740 428 05320 29 1.000 0.936 0.967 0.965 


Duration in seconds 


6 Discussion 


The goal of the study was to provide evidence of whether behavioral equality of 
model elements generalizes to semantic equality of code elements (Q1). Further- 
more, we were interested in the skip evaluation and its performance implications 


(Q2 and Q3). 


6.1 Research Question 1 — Detection Performance 


Table 1 and Table 2 present strong results in favor of Q1. The MCC for the top-5 
experiments was very strong with all MCCs being above 0.9. Even the worst 
experiment still yielded a moderate performance of 0.749. 

Table 3 provides additional context to the results by presenting the detection 
results of alternative clone detectors. As expected, tools relying heavily on the 
textual representation of clones have very low recall (Simian, NiCad, CCAligner, 
PMD) on the dataset. Most clones found by the alternative tools span only a few 
lines of code. In contrast, iClones finds large clones that include array accesses 
and manipulations. ASTNN is the best comparison tool and finds many clones 
with good precision. The approach is sensitive to hyper-parameters and to the 
training and test split, leading in some cases to a test performance close to 
MCC of 0. The low recall for Type 1-3 detectors indicate the high quality of the 
dataset. The moderate recall for Type 3/4 detectors indicate the high quality of 
SCD-PSM. Given this evidence, we conclude that Q1 holds. 


Q1 — Behavioral equality between model elements generalizes to se- 
mantic equality of code elements, allowing for semantic clone detection via 
probabilistic software modeling. 
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6.2 Research Question 2 — Skip Evaluation Scalability 


The goal of the static and dynamic stage is to reduce the number of evaluations 
that the model stage must conduct. Each stage incurs an increasing cost of 
evaluation per candidate, with the model stage taking the largest share of the 
evaluation time, 89%. Every TP has to pass the model stage to be declared 
a clone (rejecting pipeline). The skip evaluation avoided, on average, the re- 
computation of 74% (340) of the TP candidate pairs. The greedy evaluation 
avoided, on average, the evaluation of 37% of IO pairs. This offloads most of 
the evaluation time to the earlier stages, which are computationally inexpensive, 
while shortcutting the model stage. In comparison to the alternative detectors, 
SCD-PSM needs substantially more time to compute (1.32 min vs. 29min). An 
exception is ASTNN which has a similar runtime as SCD-PSM. Most of the 
runtime of SCD-PSM is caused by the operational overhead, e.g., loading the 
model from the database. Optimizing this overhead, as a theoretical maximum, 
could reduce the overall runtime on the dataset to 6.49 min given the average 
durations for each stage in Table 2. In conclusion, the skip evaluation reduces the 
number of model evaluations, which are responsible for most of the evaluation 
time, down to a quarter. 


Q2 — Skip evaluation reduces the number of evaluations for the most 
expensive stage (model) in the SCD-PSM pipeline significantly. 


6.3 Research Question 3 — Skip Evaluation Effects 


Skip evaluation can cause cascading errors given an FP. Once an FP is introduced, 
every semantic clone related to the FP has a chance to become an FP in the same 
(wrong) clone class itself. These cascading FPs are potential sources of serious 
performance degradation. Skip evaluation experiments are ranked higher and 
are significantly better than experiments that conducted an exhaustive search. 
However, the absolute performance gain is only a MCC of 0.056, hinting at a 
per-chance significance introduced by the small sample size (16 experiments). 
Nevertheless, given the evidence in Table 1 and Section 5.6, we can conclude that 
skip evaluation does not affect the performance of the detector. 


Q3 — The skip evaluation has no negative impact on the performance of 
the detector given low false-positive rates. 


7 Limitations 


SCD-PSM inherits the limitations of PSM, such as its need for a runnable program 
to build the model. PSM only models the application structure and its data, not 
references. References are changing addresses with no relation to the running 
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program. Hence, they have no meaningful underlying distribution that can be 
modeled. However, once references are dereferenced, e.g., by accessing a field, 
their accessed data will be part of the model and therefore usable in SCD-PSM. 
Nevertheless, algorithms with the sole purpose of manipulating references do not 
work with SCD-PSM. 

PSM explodes lists into singular values, since distributions do not contain any 
order information. This means executables that change the order of sequences 
are matched based on the values, not their order. As a consequence, an ascending 
and descending sorting algorithm are semantically equivalent, leading to a false- 
positive. Extending PSM to distributions of sequences alleviates the issue but is 
not a trivial task. 

SCD-PSM cannot detect Type 2-3 clones since textual similarities represent 
a different problem set. A proof can easily be constructed by adding an arbitrary 
number of statements that do not influence the behavior of the program but mis- 
lead text based detectors. Inversely, changing one character, e.g., a multiplication 
to a division, may alter the entire behavior while preserving the general textual 
similarity. 

We employed a controlled laboratory evaluation strategy that allowed us 
to exactly evaluate the performance metrics and fairly compare them between 
different clone detectors. This follows a recent trend [38,46,48] in the light of 
some criticism of opportunistic evaluations on arbitrary open source projects. 
The controlled laboratory evaluation provides purely functional performance 
results given a fixed and controlled sample of programs. The generalizability of 
results obtained from laboratory evaluations is limited; Using an opportunistic 
evaluation strategy avoids this problem. However, the strategy is prone to biases 
caused by the human oracles (often the authors themselves) or proxy oracles 
that evaluate the clones. Moreover, a fair comparison between detectors is hardly 
possible because the true recall of clones is in general unknown. A combination 
of both evaluation strategies may yield precise and generalizable results. The 
extension to this study is part of our future work. 


8 Threats to Validity 


A threat to validity in any semantic clone detection study is given by the programs 
and code fragments used in the evaluation. Semantic clones may not exhibit 
the same functional behavior or share too many lexicographical similarities. 
This study tested every clone class on its behavioral equality. Furthermore, we 
evaluated text-, token-, graph- and model-based detectors capable of detecting 
Type 1-3 clones. The low performance of Type 1-3 detectors confirmed the high 
quality of semantic clones in the benchmark. 


9 Related Work 


We started this article by defining what semantic clones means in the context of 
our approach (Section 3). While our definition is motivated by the capabilities 
of our approach, we can see strong similarities to the definition of Juergens 
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[19]. Both definitions define behavioral similarity via IO relationships. Also, 
Juergens already discussed a notion of partial and conditional similarity. This 
understanding of Type 4 clones can be seen in multiple more recent studies 
[8,6,27]. In that, we see the progress of the community in terms of Type 4 clones 
as the definition becomes more specific. 

Many studies evaluated textual clones. However, only a few studies have 
reported results on semantic clones without relaxing the definition of Type 4. 
Rattan [34] et al. provided a review of clone detection studies including ap- 
proaches focused on Type 4 clones. They concluded that some approaches solve 
approximations (i.e., complex Type 3 clones) of Type 4 clones. 

Test-based methods randomly trigger the execution of candidates and measure 
whether equal inputs cause similar outputs. Jiang and Su [18] were able to find 
semantically equivalent methods without any syntactic similarities. A similar 
approach was presented by Deissenboeck et al. [6]. One issue with test-based clone 
detection is that candidates need a similar signature. Differences in data types 
or the number of parameters can not be effectively handled. SCD-PSM works 
similarly to test-based methods in that it observes the runtime and compares 
the resulting behavior. However, SCD-PSM builds generative models from the 
observed behavior, capable of generating, conditioning, and evaluating data. 
This allows SCD-PSM to bridge signature mismatches by imputing missing code 
elements and the using a generalized type system. 

Zhao and Huang [51] developed DeepSim, which phrases the problem as a 
binary classification task. DeepSim uses neural networks to learn encodings of 
the control and data flow without observing the program’s runtime. PSM also 
uses neural networks but learns an underlying representation of the data flow 
and runtime. DeepSim was also evaluated on a Google Code Jam dataset. It 
reached an F1 score of 0.76 on the GCJ 2016 competition, while SCD-PSM 
reached 0.967 on the GCJ 2017. While not entirely comparable, the results are a 
good approximation given the similarity in the datasets. 


10 Conclusions and Future Work 


In this article, we presented Semantic Clone Detection via Probabilistic Software 
Modeling (SCD-PSM). PSM builds a Probabilistic Model (PM) from a program 
that can be used to simulate or evaluate a program. We used these PMs to detect 
semantic clones in programs that have 0% syntactic similarity. 

We discussed the representation, search space, static-, dynamic-, and model- 
similarity stages forming the main aspects of SCD-PSM. The study evaluated 
SCD-PSM in great detail resulting in an average MCC greater than 0.9. Also, the 
study showed the capability to control the false-positive rate, which is important 
for an industry adoption. Finally, we concluded that behavioral equality of model 
elements generalizes to semantic equality of code elements. 

Our future work focuses on constructing a comprehensive benchmark covering 
controlled and real-world systems for improved generalizability of clone detection 
studies. Furthermore, semantic clone detection has the potential to enable new 
methods for fault localization applications [42]. 
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Abstract. Verifying whether a UML class diagram annotated with Ob- 
ject Constraint Language (OCL) constraints is consistent involves finding 
valid instances that provably meet its structural and OCL constraints. 
Recently, many tools and techniques have been proposed to find valid in- 
stances. However, they often do not scale well when the number of OCL 
constraints significantly increases. In this paper, we present a new tool 
called QMaxUSE that is capable of automatically verifying a large num- 
ber of OCL invariants. QMaxUSE works by decomposing them into a 
set of different queries. It then uses an SMT solver to concurrently verify 
each query and pinpoints conflicting OCL invariants. Our evaluation re- 
sults suggest that QMaxUSE can offer up to 30x efficiency improvement 
in verifying UML class diagrams with a large number of OCL invariants. 


1 Introduction 


Verifying the consistency of a UML class diagram annotated with OCL con- 
straints is a challenging task [1,2,3]. This is because it requires finding an in- 
stance satisfying both structural and OCL constraints at the same time. To 
tackle this challenge, many tools and techniques have been proposed [4,5,6,7,8]. 
However, most of these tools and techniques do not scale well when the number 
of OCL invariants significantly increases [9,10,11,12,13,5,14,15,16]. These tools 
often time out or cannot pinpoint the conflicting OCL invariants that cause a 
UML class diagram to become inconsistent. 

In this paper, we present a new tool QMaxUSE that is capable of verifying a 
large number of complex OCL invariants in an efficient manner. This is achieved 
by two distinct features provided within QMaxUSE. (1) a query language that 
allows users to select parts of a UML class diagram to be verified. (2) a new 
specialised algorithm that is able to decompose a UML class diagram that has a 
large number of complex OCL invariants into different queries. These queries can 
then be verified concurrently via efficient SMT solving. The detailed explanation 
of our approach can be found in [17]. 

Related Work. Verifying the consistencies of a UML class diagram has gained 
much attention in recent years and many approaches and tools are proposed. A 
UML class diagram can be considered as a graph, so graph-based approaches are 
naturally employed for reasoning about consistencies [18,19,20,7,21]. Semerdth 


© The Author(s) 2022 
E. B. Johnsen and M. Wimmer (Eds.): FASE 2022, LNCS 13241, pp. 310-317, 2022. 
https: //doi.org/10.1007/978-3-030-99429-7_17 


QMaxUSE: A Query-based Verification Tool 311 


et al. proposed a new graph solver that is able to generate much larger number of 
objects [22]. Their approach utilises a combination of multiple advanced graph- 
based and SAT-solving techniques to achieve large-scale graphs generation. On 
the other hand, many tools incorporate logic solvers to support OCL constraints 
solving [14,16,23,24,25]. However, many of them do not scale well and cannot 
pinpoint conflicting OCL constraints when a UML class diagram is inconsistent. 
Our goal here is to provide an open-source tool that is capable of not only 
locating conflicting OCL constraints but also preserves high-performance when 
the number of OCL constraints significantly increases. 


2 Architecture 


QMaxUSE is fully automatic and integrated with the USE modelling tool [26]. 
Currently, it is command-line based and can be run under operating system 
Windows 10 (x64), Ubuntu 20.04 (x64) and Mac OS Big Sur(x64). QMaxUSE is 
implemented in Java. It consists of nearly 33k lines of code, and approximately 
3.5k lines of code are dedicated to its algorithms. The latest version of QMaxUSE 
is available at [27]: 

https: //github.com/classicwuhao/qmaxuse 

The architecture of QMaxUSE is shown in Figure 1. Overall, it has four 
layers: front-end, query engine, translation and solver. 


Front-end Query Engine Translation Solver 


Features 


Decomposer 


Model AST [QueryResult 1, 
QueryResult 
Query 


| -| Model 
Parser 


Selection 
Module 


Query 
Parser [Query ast 
Query Result 


Fig. 1. The overall architecture of QMaxUSE. 


First-order Translator 
Solver Manager 
SMT Solver 


Verification 


Front-end. At the front-end layer, QMaxUSE uses parsers from USE to gen- 
erate ASTs (abstract syntax trees) for a class diagram and OCL invariants. 
QMaxUSE provides a simple query language that allows users to choose a part 
of a class diagram and its OCL invariants to be verified. To parse a query issued 
by a user, we have designed and implemented a query parser. This parser is 
able to read multiple queries simultaneously in a specification file and produces 
corresponding ASTs. 

Query Engine. QMaxUSE’s query engine uses a set of selection algorithms to 
traverse the ASTs generated from the front-end layer to produce a query result. 
A query result essentially contains a set of classes, attributes, associations and 
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OCL invariants to be verified. At this layer, QMaxUSE also provides a specialised 
algorithm (Decomposer) that is able to decompose a class diagram along with 
OCL invariants into a set of different queries. These queries can then be verified 
concurrently using a query verification procedure. 

Translation. At the translation layer, QMaxUSE uses a first-order translator 
to translate a query into a set of first-order formulas that can be verified by the 
SMT solver. The translation here is similar to the one described in [8]. We use 
uninterpreted functions to encode classes or attributes and linear integer inequal- 
ities to capture the multiplicities at an association-end. For an OCL invariant, 
we traverse its AST and generate an SMT formula by using a combination of 
first-order theories. 

Solver. We have designed a new interface (Solver Manager) to optimise the 
interaction between QMaxUSE and the SMT solver. This interface can reduce 
extra overhead between our first-order translator and an SMT solver by min- 
imising the number of APIs calls. Currently, QMaxUSE uses Z3 as its default 
SMT solver and this new interface easily allows us to plug in other SMT solvers 
[28]. 


3 Design 


3.1 Query 


QMaxUSE allows a user to verify a particular set of features of a UML class 
diagram through a query language. A query expression accepted by QMaxUSE 
must use a select statement. It allows users to choose multiple features along 
with OCL invariants from a UML class diagram. A feature here may include 
a class, an attribute, an association or an OCL invariant. For example, the 
following query (query 1) first selects the University, Department, Student 
and Module class, an association teach along with the invariant defined under 
the Module class from the UML class diagram in Figure 2. 
query 1 : select University, Student.*, Department:teach:Student with 


Student::inv2, Module::* 

Notably, we allow users to use a wild character * to represent a set of features 
under a specified classifier. Further, it is quite common that an OCL invariant 
may use features from other classes in its expression. Hence, our selection al- 
gorithm implicitly selects these features during the execution of a query. Thus, 
query 1 also selects the Person class from Figure 2 since inv2 defined under the 
Student class imposes a constraint on the age attribute that is inherited from 
the Person class. 

For each query issued by a user, QMaxUSE launches a verification procedure 
that is able to verify the consistencies of the collected features. This verification 
procedure casts the set of collected features to a set of SMT formulas that 
can be checked by an SMT solver. If the formulas are not satisfied, QMaxUSE 
reports inconsistencies by pinpointing the OCL invariants that cause conflicts. 
For example, QMaxUSE reports that there is a conflict between OCL invariant 
inv1 and inv2 after verifying the following query (query 2). It shows that both 
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4 


manage 


1 choose 


context Person 
inv1: Person.alllnstances()->forAll(p|p.age>0 and p.age<18) 


context Student 

inv2: self.age>18 

inv3: self.year>=1 and self.year<=6 

inv4: Student.alllnstances()->forAll(s1,s2:Student|s1<>s2 implies s1.id <> s2.id) 

inv5: Student.alllnstances()->forAll(s|s.modules->forAll(m|s.year=m.year)) 

inv6: Student.alllnstances()->exists(s|s.year=6) and Student.alllnstances()->exists(s|s.year<6) 
inv7: Student.alllnstances()->forAll(s|s.modules->notEmpty()) 


context Module 
inv8: self.year>=1 and self.year<=5 


Fig. 2. A UML class diagram with the 8 OCL class invariants shows how the students 
in each department can choose multiple modules to study. 


inv1 and inv2 can make the Student class impossible to instantiate. Figure 3 
shows a screenshot of QMaxUSE after executing query 2. 
query 2: select Person.*, Student.* with Person::invl, Student::inv2 


3.2 Concurrent Verification 


QMaxUSE has a crafted algorithm that is designed for performing concurrent 
verification on UML class diagrams with a large number of OCL invariants. The 
main idea of this algorithm is that it is able to decompose a large number of com- 
plex OCL invariants into different queries. For each query, it launches a thread 
of verification procedure to verify that query. In this way, QMaxUSE is able to 
shift solving a large number of complex formulas from a single run into multiple 
simultaneous runs on a collection of much smaller and less complex formulas. 
Therefore, it is particularly powerful when the number of OCL invariants grows 
significantly. 

A high-level structure of this dedicated algorithm is shown in Algorithm 1 
[17]. This algorithm takes a UML class diagram annotated with OCL invariants 
(denoted as model) as its input and outputs a set S that contains all possible 
conflicting features. It first employs a novel decomposition algorithm to decom- 
pose a model into different parts and produces a query for each part of this 
model. It then executes each query and produces a new query result by explic- 
itly choosing those features that are used by an OCL invariant expression in 
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Fig. 3. A screenshot of running query 2 in QMaxUSE. 


a query. Once the set of query results are generated, Algorithm 1 launches a 
number of threads to verify the formulas (®) that encode query results. If the 
® are not satisfied, then this means that there must be conflicts. Finally, our 
algorithm extracts those conflicting features and saves them into the set S. 


Algorithm 1: Concurrent Verification 


Input : A UML class diagram annotated with OCL invariants (model) 
Output: A set of conflicting features cause inconsistencies (S). 


1REOASe DO; 
2 Q + Decompose(model); /*produce a set of queries Q*/ 
3 foreach q € Q do 
4 dr = qg.execute(); /* create a new query result q,*/ 
5 /* add features used in an OCL invariant into a query result q,*/ 
6 foreach inv € q do 
7 | gr.add(inv.classes(), inv.attributes(), inv.associations(), inv); 
8 end 
9 R.add(q,); 
10 end 
11 /* verify model with |R| number of threads. */ 
12 foreach q, € R do 
13 @ — Translate(q,); /*cast qr to SMT formulas*/ 
14 ThreadManager.start(QueryV eri fication(®, S)); 
15 /*check satisfiability of ® and saves each conflict occurred in @ in 
the set S*/ 
16 end 


17 return S; 
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364 

1 0.087 


(=) 


> 
əe|s|9 
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0.241 

4.604 

159.378 7.151 

TO 8.111 

TO 114.41 

TO 118.64 

8.211 

14.026 

2.587 

4.464 

3.653 
Table 1. Evaluation results. Invs=number of O invariants, Nodes=size of invariant 
ASTs, Quant=number of quantifiers, Op=number of operators. TO= Timeout (20min), 

MaxUSE=QMaxUSE without query and concurrent verification support. 


4 Results 


We use a benchmark from [8] to show the size and the complexities of OCL 
invariants QMaxUSE can handle. This benchmark has two parts. Part A only 
covers a small number of toy examples from [29] and Part B covers a wide 
range of OCL language features including: nested quantifiers, collections, logi- 
cal/arithmetic operations and navigations. In particular, Part B contains a large 
number of complex and conflicting OCL invariants. Table 1 summarises part 
of our evaluation results for QMaxUSE !. The evaluation is carried out on an 
Intel(R) Core (TM) machine that has six 2.8GHz cores with 16G memory. The 
underlying SMT solver is the Z3 SMT solver (version 4.8.10). As it can seen 
that QMaxUSE is able to handle much larger size of OCL invariants. It is able 
to gain upto 30x efficiency in improvement in verifying large number of complex 
OCL invariants. For example, it takes 131.23 seconds to verify B3 in Group B 
without using our query and concurrent techniques while QMaxUSE is able to 
finish its verification in just 4.6 seconds. 


5 Conclusion 


In this paper, we have presented our latest verification tool QMaxUSE. We 
believe that QMaxUSE can add significant value in modelling community for two 
reasons. (1) Users now are able to use QMaxUSE to incrementally verify different 
parts of their class diagrams by issuing different queries. (2) Our preliminary 
evaluation results indicate that QMaxUSE can scale well on a large number of 
complex OCL invariants because of our concurrent verification algorithm. 


1 The complete benchmark is packed within QMaxUSE release files. 
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Abstract. Test-Comp 2022 is the 4th edition of the Competition on 
Software Testing. Research competitions are a means to provide annual 
comparative evaluations. Test-Comp focusses on fully automatic software 
test generators for C programs. The results of the competition shall be 
reproducible and provide an overview of the current state of the art in the 
area of automatic test-generation. The competition was based on 4236 
test-generation tasks for C programs. Each test-generation task consisted 
of a program and a test specification (error coverage, branch coverage). 
Test-Comp 2022 had 12 participating test generators from 5 countries. 
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+ SV-Benchmarks + BENCHEXEc : TEsTCov - COVERITEAM 


1 Introduction 


The Competition on Software Testing (Test-Comp, https: //test-comp.sosy-lab.org, 
[5, 6, 7, 9]) showcases the state of the art in the area of automatic software testing. 
For the 4th time, the competition provides an overview of the results achieved 
by implementations of the most recent ideas, concepts, and algorithms for fully 
automatic test generation. This competition report describes the (updated) rules 
and definitions, presents the competition results, and discusses some interesting 
facts about the execution of the competition experiments. We use BENCHEXEc [20] 
to execute the benchmarks and the results are presented in tables and graphs 
on the competition web site (https: //test-comp.sosy-lab.org/2022/results) and are 
available in the accompanying archives (see Table 3). 


This report extends previous reports on Test-Comp [5, 6, 7, 9]. 
Reproduction packages are available on Zenodo (see Table 3). 
2< dirk. beyer@sosy-lab.org 


© The Author(s) 2022 
E. B. Johnsen and M. Wimmer (Eds.): FASE 2022, LNCS 13241, pp. 321-335, 2022. 
https: //doi.org/10.1007/978-3-030-99429-7_18 


Check for 
updates 
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Competition Goals. In summary, the goals of Test-Comp are the following [6]: 


e Establish standards for software test generation. This means, most promi- 
nently, to develop a standard for marking input values in programs, define 
an exchange format for test suites, agree on a specification language for 
test-coverage criteria, and define how to validate the resulting test suites. 

e Establish a set of benchmarks for software testing in the community. This 
means to create and maintain a set of programs together with coverage 
criteria, and to make those publicly available for researchers to be used in 
performance comparisons when evaluating a new technique. 

e Provide an overview of available tools for test-case generation and a snapshot 
of the state-of-the-art in software testing to the community. This means to 
compare, independently from particular paper projects and specific techniques, 
different test generators in terms of effectiveness and performance. 

e Increase the visibility and credits that tool developers receive. This means 
to provide a forum for presentation of tools and discussion of the latest 
technologies, and to give the participants the opportunity to publish about 
the development work that they have done. 

e Educate PhD students and other participants on how to set up performance 
experiments, package tools in a way that supports reproduction, and how to 
perform robust and accurate research experiments. 

e Provide resources to development teams that do not have sufficient computing 
resources and give them the opportunity to obtain results from experiments 
on large benchmark sets. 


Related Competitions. In the field of formal methods, competitions are 
respected as an important evaluation method and there are many competitions [3]. 
We refer to the report from Test-Comp 2020 [6] for a more detailed discussion 
and give here only the references to the most related competitions [3, 10, 41, 43]. 


2 Definitions, Formats, and Rules 


Organizational aspects such as the classification (automatic, off-site, reproducible, 
jury, training) and the competition schedule is given in the initial competi- 
tion definition [5]. In the following, we repeat some important definitions that 
are necessary to understand the results. 


Test-Generation Task. A test-generation task is a pair of an input program (pro- 
gram under test) and a test specification. A test-generation run is a non-interactive 
execution of a test generator on a single test-generation task, in order to generate a 
test suite according to the test specification. A test suite is a sequence of test cases, 
given as a directory of files according to the format for exchangeable test-suites. 1 


Execution of a Test Generator. Figure 1 illustrates the process of executing 
one test generator on the benchmark suite. One test run for a test generator gets 


1 https: //gitlab.com/sosy-lab/software/test-format 
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Program 
under Test 
Test Suite 
| (Test Cases) 
Test Test 
Generator Validator 

Coverage 
Statistics 


Test 
Specification 


Fig. 1: Flow of the Test-Comp execution for one test generator (taken from [6]) 


as input (i) a program from the benchmark suite and (ii) a test specification 
(cover bug, or cover branches), and returns as output a test suite (i.e., a set of 
test cases). The test generator is contributed by a competition participant as 
a software archive in ZIP format. The test runs are executed centrally by the 
competition organizer. The test-suite validator takes as input the test suite from 
the test generator and validates it by executing the program on all test cases: 
for bug finding it checks if the bug is exposed and for coverage it reports the 
coverage. We use the tool TrsTCov [19]? as test-suite validator. 


Test Specification. The specification for testing a program is given to the 
test generator as input file (either properties/coverage-error-call.prp or 
properties/coverage-branches.prp for Test-Comp 2022). 

The definition init (main() ) is used to define the initial states of the program 
under test by a call of function main (with no parameters). The definition FQL (£f) 
specifies that coverage definition f should be achieved. The FQL (FSHELL query 
language [30]) coverage definition COVER EDGES(@DECISIONEDGE) means that all 
branches should be covered (typically used to obtain a standard test suite for qual- 
ity assurance) and COVER EDGES(@CALL(foo)) means that a call (at least one) to 
function foo should be covered (typically used for bug finding). A complete specifi- 
cation looks like: COVER (init (main()), FQL(COVER EDGES (@DECISIONEDGE) )). 

Table 1 lists the two FQL formulas that are used in test specifications of 
Test-Comp 2022; there was no change from 2020 (except that special function 
__VERIFIER_error does not exist anymore). 


Task-Definition Format 2.0. Test-Comp 2022 used again the task-definition for- 
mat in version 2.0. 


License and Qualification. The license of each participating test generator 
must allow its free use for reproduction of the competition results. Details on 
qualification criteria can be found in the competition report of Test-Comp 2019 [7]. 


2 https: //gitlab.com/sosy-lab/software/test-suite-validator 
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Table 1: Coverage specifications used in Test-Comp 2022 (similar to 2019-2021) 


Formula Interpretation 


COVER EDGES (@CALL(reach_error)) The test suite contains at least one test 
that executes function reach_error. 

COVER EDGES (@DECISIONEDGE) The test suite contains tests such that 
all branches of the program are executed. 


3 Categories and Scoring Schema 


Benchmark Programs. The input programs were taken from the largest and 
most diverse open-source repository of software-verification and test-generation 
tasks, which is also used by SV-COMP [8]. As in 2020 and 2021, we se- 
lected all programs for which the following properties were satisfied (see is- 
sue on GitHub* and report [7]): 


compiles with gcc, if a harness for the special methods * is provided, 
should contain at least one call to a nondeterministic function, 

does not rely on nondeterministic pointers, 

does not have expected result ‘false’ for property ‘termination’, and 

has expected result ‘false’ for property ‘unreach-call’ (only for category Error 
Coverage). 


Pie eM 


This selection yielded a total of 4236 test-generation tasks, namely 776 tasks 
for category Error Coverage and 3460 tasks for category Code Coverage. The 
test-generation tasks are partitioned into categories, which are listed in Ta- 
bles 6 and 7 and described in detail on the competition web site. Figure 2 
illustrates the category composition. 


Category Error-Coverage. The first category is to show the abilities to discover 
bugs. The benchmark set consists of programs that contain a bug. Every run 
will be started by a batch script, which produces for every tool and every test- 
generation task one of the following scores: 1 point, if the validator succeeds in 
executing the program under test on a generated test case that explores the bug 
(ie., the specified function was called), and 0 points, otherwise. 


Category Branch-Coverage. The second category is to cover as many branches 
of the program as possible. The coverage criterion was chosen because many 
test generators support this standard criterion by default. Other coverage cri- 
teria can be reduced to branch coverage by transformation [29]. Every run 
will be started by a batch script, which produces for every tool and every 


3 https: //github.com/sosy-lab/sv-benchmarks 

4 nttps://github.com/sosy-lab/sv-benchmarks/pul1/774 
5 https: //test-comp.sosy-lab.org/2022/rules. php 

6 https: //test-comp.sosy-lab.org/2022/benchmarks . php 
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Fig. 2: Category structure for Test-Comp 2022; compared to Test-Comp 2021, 
sub-category ProductLines was added to both main categories Cover-Error and 
Cover-Branches 


test-generation task the coverage of branches of the program (as reported by 
TestCov [19]; a value between 0 and 1) that are executed for the generated 
test cases. The score is the returned coverage. 


Ranking. The ranking was decided based on the sum of points (normalized for 
meta categories). In case of a tie, the ranking was decided based on the run time, 


326 Dirk Beyer 


(d) Tester Archives 


(a) Test-Generation Tasks | (b) Benchmark Definitions (c) Tool-Info Modules 


(e) Test-Generation Run 


(f) Test Suite 


Fig. 3: Benchmarking components of Test-Comp and competition’s execution flow 
(same as for Test-Comp 2020) 


Table 2: Publicly available components for reproducing Test-Comp 2022 


Component Fig. 3 Repository Version 


Test-Generation Tasks (a) gitlab.com/sosy-lab/benchmarking/sv-benchmarks testcomp22 


Benchmark Definitions (b)  gitlab.com/sosy-lab/test-comp/bench-defs testcomp22 


( 
Tool-Info Modules (c) github.com/sosy-lab/benchexec 3.10 
Test-Generator Archives (d)  gitlab.com/sosy-lab/test-comp/archives-2022 testcomp22 
Benchmarking (e) github.com/sosy-lab/benchexec 3.10 
Test-Suite Format (f)  gitlab.com/sosy-lab/software/test-format testcomp22 


which is the total CPU time over all test-generation tasks. Opt-out from categories 
was possible and scores for categories were normalized based on the number of 
tasks per category (see competition report of SV-COMP 2013 [4], page 597). 


4 Reproducibility 


We followed the same competition workflow that was described in detail in 
the previous competition report (see Sect. 4, [9]). All major components that 
were used for the competition were made available in public version-control 
repositories. An overview of the components that contribute to the reproducible 
setup of Test-Comp is provided in Fig. 3, and the details are given in Table 2. 
We refer to the report of Test-Comp 2019 [7] for a thorough description of all 
components of the Test-Comp organization and how we ensure that all parts 
are publicly available for maximal reproducibility. 

In order to guarantee long-term availability and immutability of the test- 
generation tasks, the produced competition results, and the produced test suites, 
we also packaged the material and published it at Zenodo (see Table 3). 

The competition used CoVeriTEAmM [17]" again to provide participants 
access to the actual competition machines. The competition report of SV- 
COMP 2022 provides a description on reproducing individual results and on 
trouble-shooting (see Sect. 3, [10]). 


T https: //gitlab.com/sosy-lab/software/coveriteam 
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Table 3: Artifacts published for Test-Comp 2022 


Content DOI Reference 


Test-Generation Tasks 10.5281/zenodo.5831003 [12] 
Competition Results 10.5281/zenodo.5831012 [11] 
Test-Suite Generators 10.5281/zenodo.5959598 [13] 
Test Suites (Witnesses) 10.5281/zenodo.5831010 [14] 
BenchExec 10.5281/zenodo.5720267 [47] 


Table 4: Competition candidates with tool references and representing jury members; 


new indicates first-time participants, f indicates hors-concours participation 
Tester Ref. Jury member Affiliation 

CMA-ES Fuzz” [84] (hors concours) - 

CoVeriTEst [16,33] Marie-Christine Jakobs TU Darmstadt, Germany 
FuSEBMC [1, 2] Kaled Alshmrany U. of Manchester, UK 
HysripTicER® [22,42] (hors concours) - 

KLEE? [23,24] (hors concours) = 

LEGION [38,39] Gidon Ernst LMU Munich, Germany 
Lecion/SymCC"™ [39] Gidon Ernst LMU Munich, Germany 
LIBKLUZZER [36] Hoang M. Le U. of Bremen, Germany 
PRTEST [18,37] Thomas Lemberger LMU Munich, Germany 
SYMBIOTIC [25,26] Marek Chalupa Masaryk U., Brno, Czechia 
TRACERX [31,32] Joxan Jaffar National U., Singapore 
VERIFUZZ [40] Raveendra Kumar M. Tata Consultancy Services, India 


5 Results and Discussion 


This section represents the results of the competition experiments. The report 
shall help to understanding the state of the art and the advances in fully au- 
tomatic test generation for whole C programs, in terms of effectiveness (test 
coverage, as accumulated in the score) and efficiency (resource consumption 
in terms of CPU time). All results mentioned in this article were inspected 
and approved by the participants. 


Participating Test Generators. Table 4 provides an overview of the par- 
ticipating test generators and references to publications, as well as the team 
representatives of the jury of Test-Comp 2022. (The competition jury consists of 
the chair and one member of each participating team.) An online table with infor- 
mation about all participating systems is provided on the competition web site.® 
Table 5 lists the features and technologies that are used in the test generators. 

There are test generators that did not actively participate (e.g., tester archives 
taken from last year) and that are not included in rankings. Those are called 
hors-concours participations and the tool names are labeled with a symbol (7). 


8 https: //test-comp.sosy-lab. org/2022/systems.php 
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Table 5: Technologies and features that the test generators used 
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CMA-ES Fuzz” y "4 y "A "4 
CoVERITEST Jv v WA s "A 
FuSEBMC v Y Vv y v 
HYBRIDTIGERÎ "A "4 WA y 
KLEE” WA y "A 
LEGION "4 vA WA v y WA 
Lecion/SymMCC "™™ ¥ vv V vv 
LiBKLUZZER vA "A "4 v 
PRTEstT y "4 
SYMBIOTIC 4 "A v WA "A 
TRACERX v WA v "A 
VERIFUZZ v y "4 WA "A "4 


Computing Resources. The computing environment and the resource lim- 
its were the same as for Test-Comp 2020 [6]: Each test run was limited to 
8 processing units (cores), 15GB of memory, and 15min of CPU time. The 
test-suite validation was limited to 2 processing units, 7GB of memory, and 
5min of CPU time. The machines for running the experiments are part of a 
compute cluster that consists of 167 machines; each test-generation run was 
executed on an otherwise completely unloaded, dedicated machine, in order to 
achieve precise measurements. Each machine had one Intel Xeon E3-1230 v5 
CPU, with 8 processing units each, a frequency of 3.4GHz, 33GB of RAM, 
and a GNU/Linux operating system (x86_ 64-linux, Ubuntu 20.04 with Linux 
kernel 5.4). We used BENCHExsc [20] to measure and control computing resources 
(CPU time, memory, CPU energy) and VERIFIERCLOUD? to distribute, install, 


9 https: //vcloud.sosy-lab.org 
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n 


Table 6: Quantitative overview over all results; empty cells mark opt-outs; "®™ indicates 


first-time participants, ® indicates hors-concours participation 


N 
Q 
5 
= 
z R 
= Hn n 
ae onl oe 
58 55 gs 
Tester ro 2S gS 
OR Ox Os 
CMA-ES Fuzz” 0 624 382 
CoVERITEST 423 1860 2293 
FuSEBMC 628 2104 3003 
HyYBrRIDTIGER” 355 1406 1830 
KLEE” 500 1242 2125 
LEGION 57 1033 787 
Lecion/SymCC"™™” 1487 
LiBKLUZZER 528 1990 2658 
PRTEstT 145 896 945 
SYMBIOTIC 463 1802 2367 
TRACERX 0 1746 1069 
VERIFUZzZ 623 2075 2971 


run, and clean-up test-case generation runs, and to collect the results. The values 
for time and energy are accumulated over all cores of the CPU. To measure the 
CPU energy, we use CPU ENERGY METER [21] (integrated in BENCHExsc [20]). 
Further technical parameters of the competition machines are available in the 
repository which also contains the benchmark definitions. 1° 

One complete test-generation execution of the competition consisted of 
50 056 single test-generation runs. The total CPU time was 339 days and the 
consumed energy 88 kWh for one complete competition run for test generation 
(without validation). Test-suite validation consisted of 50 832 single test-suite 
validation runs. The total consumed CPU time was 15 days. Each tool was 
executed several times, in order to make sure no installation issues occur dur- 
ing the execution. Including preruns, the infrastructure managed a total of 
311754 test-generation runs (consuming 4.9 years of CPU time). The CPU 
energy was not measured during preruns. 


Quantitative Results. The quantitative results are presented in the same 
way as last year: Table 6 presents the quantitative overview of all tools and all 
categories. The head row mentions the category and the number of test-generation 


10 https: //gitlab.com/sosy-lab/test-comp/bench-defs/tree/testcomp22 
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Table 7: Overview of the top-three test generators for each category (measurement 
values for CPU time and energy rounded to two significant digits) 


Rank Tester Score CPU CPU 
Time Energy 
(inh) (in kWh) 


Cover-Error 


1 FuSEBMC 628 22 0.28 
2 VERIFUZZ 623 3.5 0.039 
3 LIBKLUZZER 528 140 1.5 
Cover-Branches 

1 FuSEBMC 2104 850 11 

2 VERIFUZZ 2075 850 11 

3 LiBKLUZZER 1990 760 8.3 
Overall 

1 FuSEBMC 3003 870 11 

2 VERIF Uzz 2971 860 11 

3 LIBKLUZZER 2658 900 9.8 


tasks in that category. The tools are listed in alphabetical order; every table 
row lists the scores of one test generator. We indicate the top three candidates 
by formatting their scores in bold face and in larger font size. An empty table 
cell means that the test generator opted-out from the respective main category 
(perhaps participating in subcategories only, restricting the evaluation to a specific 
topic). More information (including interactive tables, quantile plots for every 
category, and also the raw data in XML format) is available on the competition 
web site 1} and in the results artifact (see Table 3). Table 7 reports the top three 
test generators for each category. The consumed run time (column ‘CPU Time’) 
is given in hours and the consumed energy (column ‘Energy’) is given in kWh. 


Score-Based Quantile Functions for Quality Assessment. We use score- 
based quantile functions [20] because these visualizations make it easier to 
understand the results of the comparative evaluation. The web site !! and the 
results artifact (Table 3) include such a plot for each category; as example, we 
show the plot for category Overall (all test-generation tasks) in Fig. 4. We had 
11 test generators participating in category Overall, for which the quantile plot 
shows the overall performance over all categories (scores for meta categories 
are normalized [4]). A more detailed discussion of score-based quantile plots for 
testing is provided in the Test-Comp 2019 competition report [7]. 


Alternative Rankings. Table 8 is similar to Table 7, but contains the al- 
ternative ranking category Green Testing. Column ‘Quality’ gives the score in 
score points (sp), column ‘CPU Time’ the CPU usage in hours (h), column 


11 https: //test-comp.sosy-lab.org/2022/results 
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CMA-ES-Fuzz —— 
4000 F P CoVeriTest —*— Y iH ® 
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Symbiotic —9— 
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Min. number of test tasks 


0 E 500 1000 1500 2000 2500 3000 
Cumulative score 

Fig. 4: Quantile functions for category Overall. Each quantile function illustrates 

the quantile (x-coordinate) of the scores obtained by test-generation runs below 

a certain number of test-generation tasks (y-coordinate). More details were given 

previously |7]. The graphs are decorated with symbols to make them better 

distinguishable without color. 


Table 8: Alternative rankings; quality is given in score points (sp), CPU time 
in hours (h), energy in kilo-watt-hours (kWh), the first rank measure in kilo- 
joule per score point (kJ/sp), and the second rank measure in score points (sp); 
measurement values are rounded to 2 significant digits 


Rank Test Generator Quality CPU CPU Rank 
Time Energy Measure 

(sp) (h) (kWh) 

Green Testing (kJ/sp) 

1 TRACERX 1069 120 1.4 4.8 

2 KLEEÎ 2125 310 3.5 6.0 

3 SYMBIOTIC 2 367 540 5.9 9.0 

worst 41 


‘CPU Energy’ the CPU usage in kilo-watt-hours (kWh), and column ‘Rank 
Measure’ reports the values for the rank measure. 


Green Testing — Low Energy Consumption. Since a large part of the cost of 
test generation is caused by the energy consumption, it might be important to 
also consider the energy efficiency in rankings, as complement to the official 
Test-Comp ranking. This alternative ranking category uses the energy consump- 
tion per score point as rank measure: ne with the unit kilo-joule per 
score point (kJ/sp).The energy is measured using CPU ENERGY METER [21], 
which we use as part of BENCHExEc [20]. 


New Test Generators. To acknowledge the test generators that participated 
for the first time in Test-Comp, we list the test generators that participated for 
the first time. CMA-ES Fuzz” and FuSEBMC participated for the first time in 
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Table 9: New verifiers in Test-Comp 2021 and Test-Comp 2022; column ‘Sub- 
categories’ gives the number of executed categories 


Verifier Language First Year Sub-categories 
Lecion/SymCC "™“ C 2022 16 
CMA-ES Fuzz” C 2021 30 
FuSeBMC C 2021 30 

g 15 

2 

E I 

o 

5 10+ 2 | 

G 4 

2 

ae) 11 

g oP 9 9 7 

5 6 

T 

> 

m o0 


2019 2020 2021 2022 
Year 
Fig. 5: Number of evaluated test generators for each year (top: number of first- 
time participants; bottom: previous year’s participants) 


Test-Comp 2021, and Lecion/SymCC"™ participated first in Test-Comp 2022. 
Table 9 reports also the number of subcategories in which the tools participated. 


6 Conclusion 


For the 4th time, the Competition on Software Testing took place and provides 
an overview of test-generation tools for C programs. The competition event 
attracted 12 participating teams (see Fig. 5 for the participation numbers and 
Table 4 for the details). The competition is an off-site competition, the execution 
of the experiments is fully-automatatic and reproducible. To ensure transparency, 
all components are made available in public repositories and a jury (consisting 
of members from each team) oversees the process. The produced test suites are 
validated by the test-suite validator TesrCov. The results of the competition 
are presented at the 25th International Conference on Fundamental Approaches 
to Software Engineering at ETAPS 2022. 


Data-Availability Statement. The test-generation tasks and results of the com- 
petition are published at Zenodo, as described in Table 3. All components and data 
that are necessary for reproducing the competition are available in public version 
repositories, as specified in Table 2. For easy access, the results are presented also 
online on the competition web site nttps://test-comp.sosy-lab.org/2022/results. 


Funding Statement. This project was funded in part by the Deutsche Forschungs- 
gemeinschaft (DFG) — 418257054 (Coop). 
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Abstract. FuSeBMC is a test generator for finding security vulnerabilities in C 
programs. In Test-Comp 2021, we described a previous version that incremen- 
tally injected labels to guide Bounded Model Checking (BMC) and Evolutionary 
Fuzzing engines to produce test cases for code coverage and bug finding. This 
paper introduces an improved version of FuSeBMC that utilizes both engines to 
produce smart seeds. First, the engines run with a short time limit on a lightly 
instrumented version of the program to produce the seeds. The BMC engine is 
particularly useful in producing seeds that can pass through complex mathemati- 
cal guards. Then, FuSeBMC runs its engines with extended time limits using the 
smart seeds created in the previous round. FuSeBMC manages this process in two 
main ways. Firstly, it uses shared memory to record the labels covered by each 
test case. Secondly, it evaluates test cases, and those of high impact are turned into 
seeds for subsequent test fuzzing. In this year’s competition, we participate in the 
Cover-Error, Cover-Branches, and Overall categories. The Test-Comp 2022 re- 
sults show that we significantly increased our code coverage score from last year, 
outperforming all tools in all categories. 


Keywords: Automated Test-Case Generation - Symbolic Execution - Bounded 
Model Checking - Fuzzing - Security - Seed. 


1 Overview 


Software testing is one of the most crucial phases in software development [11]. Tests 
often expose critical bugs in software applications. In earlier work [4], we presented 
FuSeBMC, an automated test generation tool that exploits the combination of Fuzzing 
and BMC. FuSeBMC achieved second place in Test-Comp 2021 [5,3] and first place in 
the Cover-Error category. It ranked fourth in the Cover-Branches category. This year, 
we introduce a new version of FuSeBMC (v4) that adds smart seed generation and 
shared memory amongst other improvements and features. The new version signifi- 
cantly improves on the previous version, particularly relating to code coverage. One 
of the primary contributions of this paper is the linking of a grey-box fuzzer with a 
bounded model checker. A bounded model checker works by treating a program as a 
state transition system and then checking whether there exists a transition in this system 
of length less than a bound k that violates the property to be verified [6,8]. We leverage 


“Jury Member 


© The Author(s) 2022 
E. B. Johnsen and M. Wimmer (Eds.): FASE 2022, LNCS 13241, pp. 336-340, 2022. 
https: //doi.org/10.1007/978-3-030-99429-7_19 


FuSeBMC v.4: Smart Seed Generation for Hybrid Fuzzing 337 


this power of model checkers as a method for smart seed generation. Here, we rate seeds 
on two metrics. First, the depth of the deepest goal covered by the seed. Second, the 
number of goals covered uniquely by the seed. Seeds that rate highly on these metrics 
are called smart. During grey-box fuzzing, if a particular branch has not been explored, 
BMC can be used to provide a model (set of assignments to input variables) that reaches 
the branch. This model is a smart seed since it covers a previously unexplored branch. 
It is then added to a seed store. Periodically seeds are selected from the store for further 
grey-box fuzzing based on the criteria as mentioned above. However, BMC can be slow 
and resource-intensive. As an alternative, we also carry out a lightweight static program 
analysis to recognize certain restricted forms of input verification. We analyze the code 
for conditions on the input variables and ensure that seeds are only selected if they pass 
these conditions. Together, these contributions turn FuSeBMC into a world-class fuzzer. 


2 Test Generation Approach 


Figure | provides an overview of the components within FuSeBMC and how these inter- 
act. FuSeBMC makes use of the Clang tooling infrastructure [1] to instrument programs. 
In addition, FuSeBMC employs three engines in its reachability analysis: one BMC and 
two fuzzing engines. ESBMC [9,10] is a state-of-the-art SMT-based bounded model 
checker. For the two fuzzers, one is based on the American Fuzzy Lop (AFL) [7,2], 
and the other is a custom fuzzer, which we refer here to as selective fuzzer (see [4] for 
details). In the sections below, we detail how these components work together. 


Analysis and Injection Test-Generation 


C Code FuSeBMC analyzing FuSeBMC management mN 
aa AFL Select BMC 
> . + al? concolic fuzzer 
Analyze & Inject Goal’s Graph ry i Ly Test-cases 


Property 


New seeds | Goals covered 


Instrumented| C Code 
Tracer 


Seed Generation 


BMC/AFL. 


Fig. 1. FuSeBMC v4 Framework. This figure illustrates the major components of the FuSeBMC 
test generator and how they interact. Note in particular the seed store, which interacts with the 
BMC/AFL and the shared memory to produce test cases. 


Import seeds 


Goals Covered Array 


Seeds New seeds 


Y 
Shared Memory 


Code Instrumentation FuSeBMC front-end uses Clang tooling infrastructure [1] to 
parse a C program and produce an Abstract Syntax Tree (AST). While traversing the 
AST, FuSeBMC injects labels into each branch, including every conditional statement, 
loop, and function. Using these labels, FuSeBMC can measure the code coverage. 


Reachability Graph Analysis After instrumenting the C program, FuSeBMC analyzes 
it and produces a reachability graph. The graph assigns each goal label to the code block 
it is located in. Then, FuSeBMC ranks goals depending on the strategy chosen. For 
example, one strategy, which we used in Test-Comp 2022, is to prefer deeper goals over 
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shallower goals. This strategy improves the performance of FuSeBMC since a test case 
that covers a deep goal will also cover shallower goals on the path to it. FuSeBMC also 
ranks coverage metrics over others, such as conditional coverage over loop coverage. 


Seed Generation A unique aspect of the latest version of FuSeBMC is a seed genera- 
tion phase that is run prior to the start of the principal reachability analysis. In this phase, 
FuSeBMC first lightly instruments the code under test by limiting loop bounds and as- 
suming a narrow range of values for input variables. The bounds on input variables are 
further limited by carrying out a lightweight static analysis to recognize code that ap- 
plies verification conditions to input variables. After instrumenting the code, FuSeBMC 
runs its fuzzing and BMC engines with concise time limits (60 s for Test-Comp 2022). 
The test cases generated by these engines are ranked, and the highest impact test cases 
are selected as smart seeds for the next round. The selected seeds are added to the seed 
store. The impact of a test case is measured using two metrics. 

1. The number of labels covered uniquely by that test case. 

2. The maximum program depth achieved by the test case. 
ESBMC is particularly effective at seed generation as its underlying SMT solvers can be 
used to discover test cases that circumvent complex mathematical guards. Note that we 
do not rely on any specific features of the models returned by the SMT solvers. Instead, 
the strength of the method lies in the solvers’ ability to return some model that can 
satisfy a guard and cover goals lying beyond. A fuzzer on its own, randomly mutating 
a seed, struggles to explore program sections occurring behind complex guards [12]. 


Reachability Analysis Engines In its primary phase, FuSeBMC carries out reacha- 
bility analysis. Essentially, this involves running the engines in parallel with longer 
timeouts on the original, non-instrumented code with the fuzzer making use of the smart 
seeds. ESBMC is run using an incremental BMC strategy with some fixed time limit for 
each goal it attempts. FuSeBMC’s Tracer component coordinates the various engines 
through the use of shared memory. In this shared memory, we have two components. 
The first component is a “goals covered array” that stores the goals covered so far dur- 
ing the execution. Its purpose is to ensure there is no wasting effort through duplication 
of work. Secondly, the Tracer maintains a set of the currently most effective seeds for 
the fuzzer to use. 

As the engines run and produce new test cases, the Tracer monitors these and eval- 
uates them, adding those with the highest impact, as measured by the metrics above, to 
the seed store. Thus, the seed store is dynamically updated as the analysis progresses. 
Periodically, it selects a number of the most effective seeds from the store and adds 
them to shared memory for the fuzzers to use in their next fuzzing round. In parallel, 
ESBMC uses the “goals covered array” to select an as yet uncovered goal and attempts 
to find a test case that covers it. Test cases produced by ESBMC are passed directly to 
the store because they are likely to be beneficial for future fuzzing attempts. 

For example, assume that the fuzzers are unable to cover some goal L due to a 
complex condition guarding it. ESBMC can be used to create a seed that covers L. This 
seed is then passed to the store and later selected for fuzzing. The fuzzers, armed with a 
seed that covers L, may well now be able to reach goals deeper than L along L’s path. 
Thus, FuSeBMC combines the strengths of both types of engines. The BMC engine 
produces seeds that bypass complex guards and thereby help the fuzzers explore paths 
deep within the program. 
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3 Strengths and Weaknesses 


The strengths of the latest version of FuSeBMC are as follows. It runs a dedicated seed 
generation phase to start the main fuzzing effort with high-quality, high-impact seeds. 
Furthermore, these seeds are constantly being updated during the main test-generation 
phase. Beyond this, it incorporates a dedicated subsystem, the Tracer, that uses a shared 
memory store to manage the various engines. By combining the engines, the Tracer 
ensures that FuSeBMC far outperforms the individual engines or even the running of 
the engines in parallel, but isolated. The outcome of these improvements can be seen 
in the ECA and Combination benchmark sets. Previously, these posed a challenge to 
FuSeBMC. With the latest changes, FuSeBMC achieved first place in the Combination 
subcategory and took second place in the ECA subcategory of the 2022 Test-Comp 
competition. Since the benchmarks in the ECA category have remained stable between 
last year’s and this year’s competitions, we can measure FuSeBMC’s improvement in 
terms of the combined coverage it achieves across the 29 tasks. This improvement 
stands at a remarkable 60%. The 2022 Test-Comp results also show that FuSeBMC 
has achieved first place in the Cover-Branches category with high coverage and valida- 
tion statistics. However, one of the weaknesses of FuSeBMC that we plan to work on is 
that for large programs, particularly for programs that redefine C library functions, seed 
generation can be slow and consume too much of the tool’s time. 


4 Tool Setup and Configuration 


FuSeBMC can be run using the command below. The user is required to set the archi- 
tecture, the property file path, the competition strategy, and the benchmark path, as: 


fusebmc.py [-a {32, 64}] [-p PROPERTY_FILE] 
[-s {kinduction, falsi, incr, fixed}] 
[BENCHMARK_PATH] 


where -a sets the architecture to 32 or 64, -p sets the property file to PROPERTY_— 
FILE, where it has a list of all the properties to be tested. -s sets the BMC strat- 
egy to one of the listed strategies{kinduction, falsi, incr, fixed}. For Test- 
Comp’22, FuSeBMC uses incr for incremental BMC, which relies on the ESBMC’s 
symbolic execution engine to increasingly unwind the program loops using an iterative 
technique. The incr strategy verifies the program for each unwind bound up to a max- 
imum default value of 50 or indefinitely (until it exhausts the time or memory limits). 
The Benchexec tool info module is fusebmc. py and the benchmark definition file is 
FuSeBMC. xml. 


5 Software Project 


FuSeBMC is implemented using C++, and it is publicly available under the terms of the 
MIT License at GitHub!. The repository includes the latest version of FuSeBMC (ver- 
sion 4.1.14). FuSeBMC dependencies and instructions for building from source code 
are all listed in the README . md file. Test-Comp 2022 provides the script, benchmarks, 
and FuSeBMC binary to reproduce the competition’s results”. 


‘https://github.com/kaled-alshmrany/FuSeBMC 
*https://test-comp.sosy-lab.org/2022/ 
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Abstract. We present VeriFuzz 1.2 with two new enhancements: (1) 
unroll the given program to a short depth and use BMC to produce 
incomplete test inputs, which are extended into complete inputs, and (2) 
if BMC fails for this short unrolling, automatically identify the reason 
and rerun BMC with a corresponding remedial strategy. 


Keywords: Coverage Guided Fuzzing - Bounded Model Checking - Scal- 
able Model Checking 


1 Introduction 


VeriFuzz 1.0 [5] is an automated test input generation tool built on top of 
AFL [11], a Coverage Guided Fuzzing (CGF) engine, and the PRISM [7] pro- 
gram analysis framework. CGF requires initial test inputs (seeds) to generate 
newer inputs in order to build a test suite for coverage. In VeriFuzz 1.0, the 
seeds are generated as follows: (1) random seeds are either generated dynami- 
cally, or picked from a small set of unbiased inputs, and (2) for sequentialized 
concurrent programs that have deep nesting, generate test inputs using BMC 
by unrolling all the loop bodies once. However, a direct application of CGF or 
BMC to a given program may not yield required coverage as CGF may get stuck 
in “complex conditions” [10], while BMC does not scale well for programs with 
“complex loops”. To address these issues, we implemented two key enhancements 
to VeriFuzz 1.0. The first is to generate incomplete test seeds using BMC and 
complete them using random inputs. The second is to automatically identify 
the cause if BMC fails, and re-run BMC with an appropriate remedial strategy. 
These enhancements, implemented in VeriFuzz 1.2 [8], are described below. 


1.1 Enhancement 1 : New Seed Generation Approach 


Instead of generating a complete test seed using BMC, which scales poorly, 
for a given program P, we use CBMC [6] to generate an incomplete program 
P, by unwinding P only to a “short” depth d, which is heuristically guessed 
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to be small enough for BMC to scale. This short unwinding makes rest of P 
unreachable due to the incompleteness of the unwinding. But it allows BMC to 
scale much better to P,,. We then use CBMC to produce test input sequences 
that cover the branches of P,, using the cbmc options “-cover branches”. Each 
of these inputs forms a valid prefix of a complete test input for P. We denote the 
set of such prefixes with Tp. When P is executed with a prefix tp in Tp, P may 
still require additional inputs to complete it’s execution. We randomly generate 
such additional inputs, from the value ranges respecting the input types. We 
append each of these random inputs to the corresponding t, to form a complete 
input for P. Our experimentation showed that this approach helped VeriFuzz 
1.2 to cover many more branches, which could not be covered with VeriFuzz 1.0. 


1.2 Enhancement 2: Remedying A Stuck or Failed BMC 


We observed that often times, even for short unwindings on complex programs, 
BMC either gets stuck (i.e., does not terminate in the given time budget) or 
fails with some error. We investigated this problem and found that BMC may 
get stuck/fail in any of its internal phases during the translation of a given 
program into a SAT/SMT formula. Some times the formula gets generated, but 
the backend SAT/SMT solver times out due to the complexity of the formula. 
Some of the common BMC failure causes and the remedial actions are:- 


1. Large number of unwindings due to loops or recursive calls: As each unwind- 
ing causes an exponential increase in the formula size, BMC may get stuck 
in unwinding the program even for an unwinding depth as small as 10. In 
such cases, we rerun BMC with an even smaller unwinding. If the BMC still 
gets stuck in unwinding, instead of trying to generate a formula for the en- 
tire program, we try to generate one formula per path and solve each such 
formula separately. CBMC supports this with the option “—paths”. 

2. Large arrays: Large arrays require too many boolean variables (equal to 
the number of bits) to encode them into a SAT formula, which requires 
too much memory and solving time. In such cases, to use a SAT solver 
backend, we use the CBMC option “—arrays-uf-always” that translates arrays 
as uninterpreted functions, thereby avoiding the bit blasting. Alternatively, 
depending on the program features, we use the Z3 SMT solver [9] as it 
supports array theory and hence does not require bit blasting of arrays. 

3. Quadratic constraints: To ensure functional consistency of translation of fea- 
tures such as array indexing operations, techniques like Ackermann expan- 
sion are used, which lead to a quadratic number of constraints. In some cases, 
this causes more than a billion constraints to be added to the SAT formula, 
and BMC gets stuck in adding these constraints. One remedy in such cases is 
to abstract out the array operations, for example by havocing the enclosing 
functions. This remedy is currently under investigation. 

4. Timeout trap: Sometimes, the SAT formula generation goes through, but 
the solver times out. We trap the timeout, and output the test inputs corre- 
sponding to the goals that have already been covered. 
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2 Tool Architecture and Flow 


New VeriFuzz (Version 1.2) 


Yes, re-run with a remedy 


Earlier 
VeriFuzz 
(Version 1.0) 


Fig. 1. VeriFuzz architecture. 


Figure 1 shows the architecture of VeriFuzz 1.2. The yellow boxes show the 
enhancements to VeriFuzz 1.0. For BMC, we use CBMC 5.42.0, with z3 4.8.12 
and Glucose Syrup 4.0 [1]. VeriFuzz 1.2 takes two inputs: (1) a program to test, 
say P, and (2) a property to test, such as branch or error coverage. First, the 
module “BMC” invokes CBMC on P for a short unwinding, typically with a 
timeout of 1 minute, to generate the incomplete test inputs. If CBMC times 
out, then: if any of the incomplete test inputs have been generated by CBMC, 
then output those, else identify the phase where CBMC is stuck and re-run 
CBMC with a corresponding remedial strategy. Then, the “Test-Completion” 
module extends these incomplete test inputs to form complete test inputs. These 
complete tests are then passed to VeriFuzz 1.0, which fuzzes them using AFL to 
produce more test inputs. 


3 Strengths and Weaknesses 


VeriFuzz 1.0 was enhanced with minor optimizations into VeriFuzz 1.1, which 
does not contain Enhancement 1 and Enhancement 2 (see Sec. 1). VeriFuzz 1.1 
participated in Test-Comp 2020 [2], while VeriFuzz 1.2 participated in Test- 
Comp 2021 [3] and Test-Comp 2022 [4]. Here, we compare VeriFuzz 1.2’s results 
against VeriFuzz 1.1’s results, for all the categories common to Test-Comp 2022 
and 2020, except ECA (to avoid any bias due to the fixed-seeds). 

Performance: In Cover-Error, VeriFuzz 1.2 detected 93% of the errors (387 
out of 415) with an average time of 17 seconds, while VeriFuzz 1.1 detected 
91% (262 out of 287) of the errors with an average time of 33 seconds. In 
Cover-Branches, VeriFuzz 1.2 covered 68% (scored 1626 out of 2378)) of the 
branches, with an average of 14.7 minutes per benchmark, while VeriFuzz 1.1 
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covered 59% (scored 872 out of 1485) of the branches, with an average of 13.3 
minutes per benchmark. The higher time taken by VeriFuzz 1.2 directly corre- 
sponds to the increase in coverage. On device drivers in BusyBox (MemSafety) 
and LDV(ReachSafety) VeriFuzz 1.2 scored substantially better: 29 of 75 and 
57 of 290 respectively, while VeriFuzz 1.1 scored only 19 of 72 and 23 of 290 
respectively. 

Usefulness of the enhancements: We analyzed the results of VeriFuzz 1.2, 
and noticed that in some cases Enhancement 1 was sufficient, while Enhance- 
ment 2 was also necessitated in other cases. For instance, in loop-floats-scientific- 
comp/loop2-1.c and ntdrivers-simplified/cdaudio_simpl1.cil-1.c, VeriFuzz 1.1 was 
unable to detect the error, which VeriFuzz 1.2 could detect with the seeds gener- 
ated by Enhancement 1 alone. In cases like array-examples/sorting_bubblesort_ 
2_ground.i and array-tiling/mlceu.c, Enhancement 2 was also required, in addi- 
tion to Enhancement 1, to generate a seed that allowed VeriFuzz 1.2 to detect 
the error, which VeriFuzz 1.1 could not. In Cover-Branches, on benchmarks like 
loop-industry-pattern/mod3.c (2 seeds generated by BMC), and bitvector/s3_ 
srvr_la_alt.BV.c.cil.c (73 seeds generated by BMC), VeriFuzz 1.2 could cover 
90% of the branches, while VeriFuzz 1.1 could cover none. 

Weaknesses: (1) In some cases, e.g. array-multidimensional/copy-2-u.c, BMC 
is running out of memory, leading to the termination of entire VeriFuzz process. 
In some other cases, e.g. float-benchs/inv_square-1.c, the floating point interpre- 
tation mismatches between CBMC and VeriFuzz lead to unintended behavior. 
These issues indicate that our tooling needs improvement. (2) In Arrays sub- 
category of Cover-Error, VeriFuzz 1.2 took more than twice the time of VeriFuzz 
1.1. This is because, many array benchmarks contain for-loops that iterate over 
large arrays. In such cases, short unwindings of BMC do not go past the ar- 
ray initialization itself and hence the seeds generated by BMC were ineffective, 
adding to the elapsed time.(3) In BusyBox and LDV drivers, there are many 
benchmarks that VeriFuzz is unable to solve due to issues like complex loops 
and quadratic constraints, which we are currently working on. 


4 VeriFuzz Tool Configuration and Setup 


The tool is available at git@gitlab.com:sosy-lab/test-comp/archives-2022.git. To 
install and run the tool, follow the instructions in the README.txt provided 
with the tool. The benchexec tool-info module is verifuzz.py and the bench- 
mark description file is verifuzz.xml. A sample run command is as follows: 
./scripts/verifuzz.py --propertyFile coverage-error.prp example.c. In 
2022, VeriFuzz 1.2 participated in Cover-Branches and Cover-Error categories. 


5 Software Project and Contributors 


VeriFuzz is developed and maintained by the authors at TCS Research. They 
can be contacted at VeriFuzz.Tool@tcs.com. We thank everyone who has con- 
tributed to VeriFuzz, AFL, PRISM, CBMC, Glucose Syrup and Z3. 
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